H2O Tutorials
This document contains tutorials and training materials for H2O-3. If you find any problems with the tutorial code, please open an issue in this repository. For general H2O questions, please post those to Stack Overflow using the "h2o" tag or join the H2O Stream Google Group for questions that don't fit into the Stack Overflow format.Finding tutorial material in Github
There are a number of tutorials on all sorts of topics in this repo. To help you get started, here are some of the most useful topics in both R and Python.R Tutorials
Intro to H2O in R H2O Grid Search & Model Selection in R H2O Deep Learning in R H2O Stacked Ensembles in R H2O AutoML in R LatinR 2019 H2O Tutorial (broad overview of all the above topics)Python Tutorials
Intro to H2O in Python H2O Grid Search & Model Selection in Python H2O Stacked Ensembles in Python H2O AutoML in PythonMost current material
Tutorials in the master branch are intended to work with the lastest stable version of H2O.
URL | |
---|---|
Training material | https://github.com/h2oai/h2o-tutorials/blob/master/SUMMARY.md |
Latest stable H2O release | http://h2o.ai/download |
URL | |
---|---|
Training material | https://github.com/h2oai/h2o-tutorials/tree/master/h2o-world-2017/README.md |
Wheeler-2 H2O release | http://h2o-release.s3.amazonaws.com/h2o/rel-wheeler/2/index.html |
URL | |
---|---|
Training material | https://github.com/h2oai/h2o-tutorials/blob/h2o-world-2015-training/SUMMARY.md |
Tibshirani-3 H2O release | http://h2o-release.s3.amazonaws.com/h2o/rel-tibshirani/3/index.html |
# As current user
pip install -r requirements.txt
# As root user
sudo -E pip install -r requirements.txt
Note: If you are behind a corporate proxy you may need to set environment variables for https_proxy
accordingly.
# If you are behind a corporate proxy
export https_proxy=https://<user>:<password>@<proxy_server>:<proxy_port>
# As current user
pip install -r requirements.txt
# If you are behind a corporate proxy
export https_proxy=https://<user>:<password>@<proxy_server>:<proxy_port>
# As root user
sudo -E pip install -r requirements.txt
## R installation instructions are at http://h2o.ai/download
library(h2o)
h2o.init(nthreads=-1, max_mem_size="2G")
h2o.removeAll() ## clean slate - just in case the cluster was already running
The h2o.deeplearning
function fits H2O's Deep Learning models from within R.
We can run the example from the man page using the example
function, or run a longer demonstration from the h2o
package using the demo
function:
args(h2o.deeplearning)
help(h2o.deeplearning)
example(h2o.deeplearning)
#demo(h2o.deeplearning) #requires user interaction
While H2O Deep Learning has many parameters, it was designed to be just as easy to use as the other supervised training methods in H2O.
Early stopping, automatic data standardization and handling of categorical variables and missing values and adaptive learning rates (per weight) reduce the amount of parameters the user has to specify.
Often, it's just the number and sizes of hidden layers, the number of epochs and the activation function and maybe some regularization techniques.
setwd("~/h2o-tutorials/tutorials/deeplearning") ##For RStudio
spiral <- h2o.importFile(path = normalizePath("../data/spiral.csv"))
grid <- h2o.importFile(path = normalizePath("../data/grid.csv"))
# Define helper to plot contours
plotC <- function(name, model, data=spiral, g=grid) {
data <- as.data.frame(data) #get data from into R
pred <- as.data.frame(h2o.predict(model, g))
n=0.5*(sqrt(nrow(g))-1); d <- 1.5; h <- d*(-n:n)/n
plot(data[,-3],pch=19,col=data[,3],cex=0.5,
xlim=c(-d,d),ylim=c(-d,d),main=name)
contour(h,h,z=array(ifelse(pred[,1]=="Red",0,1),
dim=c(2*n+1,2*n+1)),col="blue",lwd=2,add=T)
}
We build a few different models:
#dev.new(noRStudioGD=FALSE) #direct plotting output to a new window
par(mfrow=c(2,2)) #set up the canvas for 2x2 plots
plotC( "DL", h2o.deeplearning(1:2,3,spiral,epochs=1e3))
plotC("GBM", h2o.gbm (1:2,3,spiral))
plotC("DRF", h2o.randomForest(1:2,3,spiral))
plotC("GLM", h2o.glm (1:2,3,spiral,family="binomial"))
Let's investigate some more Deep Learning models.
First, we explore the evolution over training time (number of passes over the data), and we use checkpointing to continue training the same model:
#dev.new(noRStudioGD=FALSE) #direct plotting output to a new window
par(mfrow=c(2,2)) #set up the canvas for 2x2 plots
ep <- c(1,250,500,750)
plotC(paste0("DL ",ep[1]," epochs"),
h2o.deeplearning(1:2,3,spiral,epochs=ep[1],
model_id="dl_1"))
plotC(paste0("DL ",ep[2]," epochs"),
h2o.deeplearning(1:2,3,spiral,epochs=ep[2],
checkpoint="dl_1",model_id="dl_2"))
plotC(paste0("DL ",ep[3]," epochs"),
h2o.deeplearning(1:2,3,spiral,epochs=ep[3],
checkpoint="dl_2",model_id="dl_3"))
plotC(paste0("DL ",ep[4]," epochs"),
h2o.deeplearning(1:2,3,spiral,epochs=ep[4],
checkpoint="dl_3",model_id="dl_4"))
You can see how the network learns the structure of the spirals with enough training time.
We explore different network architectures next:
#dev.new(noRStudioGD=FALSE) #direct plotting output to a new window
par(mfrow=c(2,2)) #set up the canvas for 2x2 plots
for (hidden in list(c(11,13,17,19),c(42,42,42),c(200,200),c(1000))) {
plotC(paste0("DL hidden=",paste0(hidden, collapse="x")),
h2o.deeplearning(1:2,3,spiral,hidden=hidden,epochs=500))
}
It is clear that different configurations can achieve similar performance, and that tuning will be required for optimal performance.
Next, we compare between different activation functions, including one with 50% dropout regularization in the hidden layers:
#dev.new(noRStudioGD=FALSE) #direct plotting output to a new window
par(mfrow=c(2,2)) #set up the canvas for 2x2 plots
for (act in c("Tanh","Maxout","Rectifier","RectifierWithDropout")) {
plotC(paste0("DL ",act," activation"),
h2o.deeplearning(1:2,3,spiral,
activation=act,hidden=c(100,100),epochs=1000))
}
Clearly, the dropout rate was too high or the number of epochs was too low for the last configuration, which often ends up performing the best on larger datasets where generalization is important.
More information about the parameters can be found in the H2O Deep Learning booklet.
df <- h2o.importFile(path = normalizePath("../data/covtype.full.csv"))
dim(df)
df
splits <- h2o.splitFrame(df, c(0.6,0.2), seed=1234)
train <- h2o.assign(splits[[1]], "train.hex") # 60%
valid <- h2o.assign(splits[[2]], "valid.hex") # 20%
test <- h2o.assign(splits[[3]], "test.hex") # 20%
Here's a scalable way to do scatter plots via binning (works for categorical and numeric columns) to get more familiar with the dataset.
#dev.new(noRStudioGD=FALSE) #direct plotting output to a new window
par(mfrow=c(1,1)) # reset canvas
plot(h2o.tabulate(df, "Elevation", "Cover_Type"))
plot(h2o.tabulate(df, "Horizontal_Distance_To_Roadways", "Cover_Type"))
plot(h2o.tabulate(df, "Soil_Type", "Cover_Type"))
plot(h2o.tabulate(df, "Horizontal_Distance_To_Roadways", "Elevation" ))
Cover_Type
column, a categorical feature with 7 levels, and the Deep Learning model will be tasked to perform (multi-class) classification.
It uses the other 12 predictors of the dataset, of which 10 are numerical, and 2 are categorical with a total of 44 levels.
We can expect the Deep Learning model to have 56 input neurons (after automatic one-hot encoding).
response <- "Cover_Type"
predictors <- setdiff(names(df), response)
predictors
To keep it fast, we only run for one epoch (one pass over the training data).
m1 <- h2o.deeplearning(
model_id="dl_model_first",
training_frame=train,
validation_frame=valid, ## validation dataset: used for scoring and early stopping
x=predictors,
y=response,
#activation="Rectifier", ## default
#hidden=c(200,200), ## default: 2 hidden layers with 200 neurons each
epochs=1,
variable_importances=T ## not enabled by default
)
summary(m1)
Inspect the model in Flow for more information about model building etc.
by issuing a cell with the content getModel "dl_model_first"
, and pressing Ctrl-Enter.
head(as.data.frame(h2o.varimp(m1)))
m2 <- h2o.deeplearning(
model_id="dl_model_faster",
training_frame=train,
validation_frame=valid,
x=predictors,
y=response,
hidden=c(32,32,32), ## small network, runs faster
epochs=1000000, ## hopefully converges earlier...
score_validation_samples=10000, ## sample the validation dataset (faster)
stopping_rounds=2,
stopping_metric="misclassification", ## could be "MSE","logloss","r2"
stopping_tolerance=0.01
)
summary(m2)
plot(m2)
rho
and epsilon
, which balance the global and local search efficiencies.
rho
is the similarity to prior weight updates (similar to momentum), and epsilon
is a parameter that prevents the optimization to get stuck in local optima.
Defaults are rho=0.99
and epsilon=1e-8
.
For cases where convergence speed is very important, it might make sense to perform a few runs to optimize these two parameters (e.g., with rho in c(0.9,0.95,0.99,0.999)
and epsilon in c(1e-10,1e-8,1e-6,1e-4)
).
Of course, as always with grid searches, caution has to be applied when extrapolating grid search results to a different parameter regime (e.g., for more epochs or different layer topologies or activation functions, etc.).
If adaptive_rate
is disabled, several manual learning rate parameters become important: rate
, rate_annealing
, rate_decay
, momentum_start
, momentum_ramp
, momentum_stable
and nesterov_accelerated_gradient
, the discussion of which we leave to H2O Deep Learning booklet.
m3 <- h2o.deeplearning(
model_id="dl_model_tuned",
training_frame=train,
validation_frame=valid,
x=predictors,
y=response,
overwrite_with_best_model=F, ## Return the final model after 10 epochs, even if not the best
hidden=c(128,128,128), ## more hidden layers -> more complex interactions
epochs=10, ## to keep it short enough
score_validation_samples=10000, ## downsample validation set for faster scoring
score_duty_cycle=0.025, ## don't score more than 2.5% of the wall time
adaptive_rate=F, ## manually tuned learning rate
rate=0.01,
rate_annealing=2e-6,
momentum_start=0.2, ## manually tuned momentum
momentum_stable=0.4,
momentum_ramp=1e7,
l1=1e-5,## add some L1/L2 regularization
l2=1e-5,
max_w2=10 ## helps stability for Rectifier
)
summary(m3)
Let's compare the training error with the validation and test set errors
h2o.performance(m3, train=T) ## sampled training data (from model building)
h2o.performance(m3, valid=T) ## sampled validation data (from model building)
h2o.performance(m3, newdata=train) ## full training data
h2o.performance(m3, newdata=valid) ## full validation data
h2o.performance(m3, newdata=test) ## full test data
To confirm that the reported confusion matrix on the validation set (here, the test set) was correct, we make a prediction on the test set and compare the confusion matrices explicitly:
pred <- h2o.predict(m3, test)
pred
test$Accuracy <- pred$predict == test$Cover_Type
1-mean(test$Accuracy)
sampled_train=train[1:10000,]
The simplest hyperparameter search method is a brute-force scan of the full Cartesian product of all combinations specified by a grid search:
hyper_params <- list(
hidden=list(c(32,32,32),c(64,64)),
input_dropout_ratio=c(0,0.05),
rate=c(0.01,0.02),
rate_annealing=c(1e-8,1e-7,1e-6)
)
hyper_params
grid <- h2o.grid(
algorithm="deeplearning",
grid_id="dl_grid",
training_frame=sampled_train,
validation_frame=valid,
x=predictors,
y=response,
epochs=10,
stopping_metric="misclassification",
stopping_tolerance=1e-2, ## stop when misclassification does not improve by >=1% for 2 scoring events
stopping_rounds=2,
score_validation_samples=10000, ## downsample validation set for faster scoring
score_duty_cycle=0.025, ## don't score more than 2.5% of the wall time
adaptive_rate=F, ## manually tuned learning rate
momentum_start=0.5, ## manually tuned momentum
momentum_stable=0.9,
momentum_ramp=1e7,
l1=1e-5,
l2=1e-5,
activation=c("Rectifier"),
max_w2=10, ## can help improve stability for Rectifier
hyper_params=hyper_params
)
grid
Let's see which model had the lowest validation error:
grid <- h2o.getGrid("dl_grid",sort_by="err",decreasing=FALSE)
grid
## To see what other "sort_by" criteria are allowed
#grid <- h2o.getGrid("dl_grid",sort_by="wrong_thing",decreasing=FALSE)
## Sort by logloss
h2o.getGrid("dl_grid",sort_by="logloss",decreasing=FALSE)
## Find the best model and its full set of parameters
grid@summary_table[1,]
best_model <- h2o.getModel(grid@model_ids[[1]])
best_model
print(best_model@allparameters)
print(h2o.performance(best_model, valid=T))
print(h2o.logloss(best_model, valid=T))
max_models
models with parameters drawn randomly from user-specified distributions (here, uniform).
For this example, we use the adaptive learning rate and focus on tuning the network architecture and the regularization parameters.
We also let the grid search stop automatically once the performance at the top of the leaderboard doesn't change much anymore, i.e., once the search has converged.
hyper_params <- list(
activation=c("Rectifier","Tanh","Maxout","RectifierWithDropout","TanhWithDropout","MaxoutWithDropout"),
hidden=list(c(20,20),c(50,50),c(30,30,30),c(25,25,25,25)),
input_dropout_ratio=c(0,0.05),
l1=seq(0,1e-4,1e-6),
l2=seq(0,1e-4,1e-6)
)
hyper_params
## Stop once the top 5 models are within 1% of each other (i.e., the windowed average varies less than 1%)
search_criteria = list(strategy = "RandomDiscrete", max_runtime_secs = 360, max_models = 100, seed=1234567, stopping_rounds=5, stopping_tolerance=1e-2)
dl_random_grid <- h2o.grid(
algorithm="deeplearning",
grid_id = "dl_grid_random",
training_frame=sampled_train,
validation_frame=valid,
x=predictors,
y=response,
epochs=1,
stopping_metric="logloss",
stopping_tolerance=1e-2, ## stop when logloss does not improve by >=1% for 2 scoring events
stopping_rounds=2,
score_validation_samples=10000, ## downsample validation set for faster scoring
score_duty_cycle=0.025, ## don't score more than 2.5% of the wall time
max_w2=10, ## can help improve stability for Rectifier
hyper_params = hyper_params,
search_criteria = search_criteria
)
grid <- h2o.getGrid("dl_grid_random",sort_by="logloss",decreasing=FALSE)
grid
grid@summary_table[1,]
best_model <- h2o.getModel(grid@model_ids[[1]]) ## model with lowest logloss
best_model
Let's look at the model with the lowest validation misclassification rate:
grid <- h2o.getGrid("dl_grid",sort_by="err",decreasing=FALSE)
best_model <- h2o.getModel(grid@model_ids[[1]]) ## model with lowest classification error (on validation, since it was available during training)
h2o.confusionMatrix(best_model,valid=T)
best_params <- best_model@allparameters
best_params$activation
best_params$hidden
best_params$input_dropout_ratio
best_params$l1
best_params$l2
epochs, l1, l2, max_w2, score_interval, train_samples_per_iteration, input_dropout_ratio, hidden_dropout_ratios, score_duty_cycle, classification_stop, regression_stop, variable_importances, force_load_balance
can be modified between checkpoint restarts, it is best to specify as many parameters as possible explicitly.
max_epochs <- 12 ## Add two more epochs
m_cont <- h2o.deeplearning(
model_id="dl_model_tuned_continued",
checkpoint="dl_model_tuned",
training_frame=train,
validation_frame=valid,
x=predictors,
y=response,
hidden=c(128,128,128), ## more hidden layers -> more complex interactions
epochs=max_epochs, ## hopefully long enough to converge (otherwise restart again)
stopping_metric="logloss", ## logloss is directly optimized by Deep Learning
stopping_tolerance=1e-2, ## stop when validation logloss does not improve by >=1% for 2 scoring events
stopping_rounds=2,
score_validation_samples=10000, ## downsample validation set for faster scoring
score_duty_cycle=0.025, ## don't score more than 2.5% of the wall time
adaptive_rate=F, ## manually tuned learning rate
rate=0.01,
rate_annealing=2e-6,
momentum_start=0.2, ## manually tuned momentum
momentum_stable=0.4,
momentum_ramp=1e7,
l1=1e-5,## add some L1/L2 regularization
l2=1e-5,
max_w2=10 ## helps stability for Rectifier
)
summary(m_cont)
plot(m_cont)
Once we are satisfied with the results, we can save the model to disk (on the cluster).
In this example, we store the model in a directory called mybest_deeplearning_covtype_model
, which will be created for us since force=TRUE
.
path <- h2o.saveModel(m_cont,
path="./mybest_deeplearning_covtype_model", force=TRUE)
It can be loaded later with the following command:
print(path)
m_loaded <- h2o.loadModel(path)
summary(m_loaded)
This model is fully functional and can be inspected, restarted, or used to score a dataset, etc.
Note that binary compatibility between H2O versions is currently not guaranteed.
nfolds>1
instead of (or in addition to) a validation frame, and N+1
models will be built: 1 model on the full training data, and N models with each 1/N-th of the data held out (there are different holdout strategies).
Those N models then score on the held out data, and their combined predictions on the full training data are scored to get the cross-validation metrics.
dlmodel <- h2o.deeplearning(
x=predictors,
y=response,
training_frame=train,
hidden=c(10,10),
epochs=1,
nfolds=5,
fold_assignment="Modulo" # can be "AUTO", "Modulo", "Random" or "Stratified"
)
dlmodel
N-fold cross-validation is especially useful with early stopping, as the main model will pick the ideal number of epochs from the convergence behavior of the cross-validation models.
train$bin_response <- ifelse(train[,response]=="class_1", 0, 1)
Let's build a quick model and inspect the model:
dlmodel <- h2o.deeplearning(
x=predictors,
y="bin_response",
training_frame=train,
hidden=c(10,10),
epochs=0.1
)
summary(dlmodel)
Instead of a binary classification model, we find a regression model (H2ORegressionModel
) that contains only 1 output neuron (instead of 2).
The reason is that the response was a numerical feature (ordinal numbers 0 and 1), and H2O Deep Learning was run with distribution=AUTO
, which defaulted to a Gaussian regression problem for a real-valued response.
H2O Deep Learning supports regression for distributions other than Gaussian
such as Poisson
, Gamma
, Tweedie
, Laplace
.
It also supports Huber
loss and per-row offsets specified via an offset_column
.
We refer to our H2O Deep Learning regression code examples for more information.
To perform classification, the response must first be turned into a categorical (factor) feature:
train$bin_response <- as.factor(train$bin_response) ##make categorical
dlmodel <- h2o.deeplearning(
x=predictors,
y="bin_response",
training_frame=train,
hidden=c(10,10),
epochs=0.1
#balance_classes=T ## enable this for high class imbalance
)
summary(dlmodel) ## Now the model metrics contain AUC for binary classification
plot(h2o.performance(dlmodel)) ## display ROC curve
Now the model performs (binary) classification, and has multiple (2) output neurons.
Tanh
, a scaled and shifted variant of the sigmoid which is symmetric around 0.
Since its output values are bounded by -1..1, the stability of the neural network is rarely endangered.
However, the derivative of the tanh function is always non-zero and back-propagation (training) of the weights is more computationally expensive than for rectified linear units, or Rectifier
, which is max(0,x)
and has vanishing gradient for x<=0
, leading to much faster training speed for large networks and is often the fastest path to accuracy on larger problems.
In case you encounter instabilities with the Rectifier
(in which case model building is automatically aborted), try a limited value to re-scale the weights: max_w2=10
.
The Maxout
activation function is computationally more expensive, but can lead to higher accuracy.
It is a generalized version of the Rectifier with two non-zero channels.
In practice, the Rectifier
(and RectifierWithDropout
, see below) is the most versatile and performant option for most problems.
l1
and l2
parameters.
Intuition: L1 lets only strong weights survive (constant pulling force towards zero), while L2 prevents any single weight from getting too big.
Dropout has recently been introduced as a powerful generalization technique, and is available as a parameter per layer, including the input layer.
input_dropout_ratio
controls the amount of input layer neurons that are randomly dropped (set to zero), while hidden_dropout_ratios
are specified for each hidden layer.
The former controls overfitting with respect to the input data (useful for high-dimensional noisy data), while the latter controls overfitting of the learned features.
Note that hidden_dropout_ratios
require the activation function to end with ...WithDropout
.
stopping_metric
does not improve by at least stopping_tolerance
(0.01 means 1% improvement) for stopping_rounds
consecutive scoring events on the training (or validation) data.
By default, overwrite_with_best_model
is enabled and the model returned after training for the specified number of epochs (or after stopping early due to convergence) is the model that has the best training set error (according to the metric specified by stopping_metric
), or, if a validation set is provided, the lowest validation set error.
Note that the training or validation set errors can be based on a subset of the training or validation data, depending on the values for score_validation_samples
or score_training_samples
, see below.
For early stopping on a predefined error rate on the training data (accuracy for classification or MSE for regression), specify classification_stop
or regression_stop
.
train_samples_per_iteration
matters especially in multi-node operation.
It controls the number of rows trained on for each MapReduce iteration.
Depending on the value selected, one MapReduce pass can sample observations, and multiple such passes are needed to train for one epoch.
All H2O compute nodes then communicate to agree on the best model coefficients (weights/biases) so far, and the model may then be scored (controlled by other parameters below).
The default value of -2
indicates auto-tuning, which attemps to keep the communication overhead at 5% of the total runtime.
The parameter target_ratio_comm_to_comp
controls this ratio.
This parameter is explained in more detail in the H2O Deep Learning booklet,
Tanh
).
If variable importances are computed, it is recommended to turn on use_all_factor_levels
(K input neurons for K levels).
The experimental option max_categorical_features
uses feature hashing to reduce the number of input neurons via the hash trick at the expense of hash collisions and reduced accuracy.
Another way to reduce the dimensionality of the (categorical) features is to use h2o.glrm()
, we refer to the GLRM tutorial for more details.
sparse
option.
This will result in the input not being standardized (0 mean, 1 variance), but only de-scaled (1 variance) and 0 values remain 0, leading to more efficient back-propagation.
Sparsity is also a reason why CPU implementations can be faster than GPU implementations, because they can take advantage of if/else statements more effectively.
h2o.impute
function to do your own mean imputation.
Gaussian
distributions and Squared
loss, H2O Deep Learning supports Poisson
, Gamma
, Tweedie
and Laplace
distributions.
It also supports Absolute
and Huber
loss and per-row offsets specified via an offset_column
.
Observation weights are supported via a user-specified weights_column
.
We refer to our H2O Deep Learning R test code examples for more information.
export_weights_and_biases
, and they can be accessed as follows:
iris_dl <- h2o.deeplearning(1:4,5,as.h2o(iris),
export_weights_and_biases=T)
h2o.weights(iris_dl, matrix_id=1)
h2o.weights(iris_dl, matrix_id=2)
h2o.weights(iris_dl, matrix_id=3)
h2o.biases(iris_dl, vector_id=1)
h2o.biases(iris_dl, vector_id=2)
h2o.biases(iris_dl, vector_id=3)
#plot weights connecting `Sepal.Length` to first hidden neurons
plot(as.data.frame(h2o.weights(iris_dl, matrix_id=1))[,1])
reproducible=T
and set seed=1337
(pick any integer).
This will not work for big data for technical reasons, and is probably also not desired because of the significant slowdown (runs on 1 core only).
score_validation_samples
(defaults to 0: all) or score_training_samples
(defaults to 10,000 rows, since the training error is only used for early stopping and monitoring).
For large datasets, Deep Learning can automatically sample the validation set to avoid spending too much time in scoring during training, especially since scoring results are not currently displayed in the model returned to R.
Note that the default value of score_duty_cycle=0.1
limits the amount of time spent in scoring to 10%, so a large number of scoring samples won't slow down overall training progress too much, but it will always score once after the first MapReduce iteration, and once at the end of training.
Stratified sampling of the validation dataset can help with scoring on datasets with class imbalance.
Note that this option also requires balance_classes
to be enabled (used to over/under-sample the training dataset, based on the max.
relative size of the resulting training dataset, max_after_balance_size
):
h2o.shutdown(prompt=FALSE)
setwd()
to the location of this script.
h2o.init()
starts H2O in R's current working directory.
h2o.importFile() looks for files from the perspective of where H2O was started.
More examples and explanations can be found in our H2O GLM booklet and on our H2O Github Repository.
## R installation instructions are at http://h2o.ai/download
library(h2o)
h2o.init(nthreads=-1, max_mem_size="2G")
h2o.removeAll() ## clean slate - just in case the cluster was already running
D = h2o.importFile(path = normalizePath("../data/covtype.full.csv"))
h2o.summary(D)
We have 11 numeric and two categorical features.
Response is "Cover_Type" and has 7 classes.
Let's split the data into Train/Test/Validation with train having 70% and Test and Validation 15% each:
data = h2o.splitFrame(D,ratios=c(.7,.15),destination_frames = c("train","test","valid"))
names(data)
m1 = h2o.glm(training_frame = data$Train, validation_frame = data$Valid, x = x, y = y,family='multinomial',solver='L_BFGS')
h2o.confusionMatrix(m1, valid=TRUE)
The model predicts only the majority class so it's not useful at all! Maybe we regularized it too much, let's try again without regularization:
m2 = h2o.glm(training_frame = data$Train, validation_frame = data$Valid, x = x, y = y,family='multinomial',solver='L_BFGS', lambda = 0)
h2o.confusionMatrix(m2, valid=FALSE) # get confusion matrix in the training data
h2o.confusionMatrix(m2, valid=TRUE) # get confusion matrix in the validation data
No overfitting (as train and test performance are the same), regularization is not needed in this case.
This model is actually useful.
It got 28% classification error, down from 51% obtained by predicting majority class only.
class_1
and class_2
(the two majority classes) and build a binomial model deciding between them.
D_binomial = D[D$Cover_Type %in% c("class_1","class_2"),]
h2o.setLevels(D_binomial$Cover_Type,c("class_1","class_2"))
# split to train/test/validation again
data_binomial = h2o.splitFrame(D_binomial,ratios=c(.7,.15),destination_frames = c("train_b","test_b","valid_b"))
names(data_binomial)
We can run a binomial model now:
m_binomial = h2o.glm(training_frame = data_binomial$Train, validation_frame = data_binomial$Valid, x = x, y = y, family='binomial',lambda=0)
h2o.confusionMatrix(m_binomial, valid = TRUE)
h2o.confusionMatrix(m_binomial, valid = TRUE)
The output for a binomial problem is slightly different from multinomial.
The confusion matrix now has a threshold attached to it.
The model produces probability of class_1
and class_2
similarly to multinomial example earlier.
However, this time we only have two classes and we can tune the classification to our needs.
The classification errors in binomial cases have a particular meaning: we call them false-positive and false negative.
In reality, each can have a different cost associated with it, so we want to tune our classifier accordingly.
The common way to evaluate a binary classifier performance is to look at its ROC curve.
The ROC curve plots the true positive rate versus false positive rate.
We can plot it from the H2O model output:
fpr = m_binomial@model$training_metrics@metrics$thresholds_and_metric_scores$fpr
tpr = m_binomial@model$training_metrics@metrics$thresholds_and_metric_scores$tpr
fpr_val = m_binomial@model$validation_metrics@metrics$thresholds_and_metric_scores$fpr
tpr_val = m_binomial@model$validation_metrics@metrics$thresholds_and_metric_scores$tpr
plot(fpr,tpr, type='l')
title('AUC')
lines(fpr_val,tpr_val,type='l',col='red')
legend("bottomright",c("Train", "Validation"),col=c("black","red"),lty=c(1,1),lwd=c(3,3))
The area under the ROC curve (AUC) is a common "good fit" metric for binary classifiers.
For this example, the results were:
h2o.auc(m_binomial,valid=FALSE) # on train
h2o.auc(m_binomial,valid=TRUE) # on test
The default confusion matrix is computed at thresholds that optimize the F1 score.
We can choose different thresholds - the H2O output shows optimal thresholds for some common metrics.
m_binomial@model$training_metrics@metrics$max_criteria_and_metric_scores
The model we just built gets 23% classification error at the F1-optimizing threshold, so there is still room for improvement.
Let's add some features:
There are 11 numerical predictors in the dataset, we will cut them into intervals and add a categorical variable for each
We can add interaction terms capturing interactions between categorical variables
Let's make a convenience function to cut the column into intervals working on all three of our datasets (Train/Validation/Test).
We'll use h2o.hist
to determine interval boundaries (but there are many more ways to do that!) on the Train set.cut_column <- function(data,="" col)="" {="" #="" need="" lower="" upper="" bound="" due="" to="" h2o.cut="" behavior="" (points="" <="" the="" first="" break="" or=""> the last break are replaced with missing value)
min_val = min(data$Train[,col])-1
max_val = max(data$Train[,col])+1
x = h2o.hist(data$Train[, col])
# use only the breaks with enough support
breaks = x$breaks[which(x$counts > 1000)]
# assign level names
lvls = c("min",paste("i_",breaks[2:length(breaks)-1],sep="),"max")
col_cut ->
Now let's make a convenience function generating interaction terms on all three of our datasets.
We'll use h2o.interaction
:
interactions
Finally, let's wrap addition of the features into a separate function call, as we will use it again later.
We'll add intervals for each numeric column and interactions between each pair of binary columns.
# add features to our cover type example
# let's cut all the numerical columns into intervals and add interactions between categorical terms
add_features
Now we generate new features and add them to the dataset.
We'll also need to generate column names again, as we added more columns:
# Add Features
data_binomial_ext
Let's build the model! We should add some regularization this time because we added correlated variables, so let's try the default:
m_binomial_1_ext = try(h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial'))
Oops, doesn't run - well, we know have more features than the default method can solve with 2GB of RAM.
Let's try L-BFGS instead.
m_binomial_1_ext = h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial', solver='L_BFGS')
h2o.confusionMatrix(m_binomial_1_ext)
h2o.auc(m_binomial_1_ext,valid=TRUE)
Not much better, maybe too much regularization? Let's pick a smaller lambda and try again.
m_binomial_2_ext = h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial', solver='L_BFGS', lambda=1e-4)
h2o.confusionMatrix(m_binomial_2_ext, valid=TRUE)
h2o.auc(m_binomial_2_ext,valid=TRUE)
Way better, we got an AUC of .91 and classification error of 0.180838.
We picked our regularization strength arbitrarily.
Also, we used only the l2 penalty but we added lot of extra features, some of which may be useless.
Maybe we can do better with an l1 penalty.
So now we want to run a lambda search to find optimal penalty strength and we want to have a non-zero l1 penalty to get sparse solution.
We'll use the IRLSM solver this time as it does much better with lambda search and l1 penalty.
Recall we were not able to use it before.
We can use it now as we are running a lambda search that will filter out a large portion of the inactive (coefficient==0) predictors.
m_binomial_3_ext = h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial', lambda_search=TRUE)
h2o.confusionMatrix(m_binomial_3_ext, valid=TRUE)
h2o.auc(m_binomial_3_ext,valid=TRUE)
Better yet, we have 17% error and we used only 3000 out of 7000 features.
Ok, our new features improved the binomial model significantly, so let's revisit our former multinomial model and see if they make a difference there (they should!):
# Multinomial Model 2
# let's revisit the multinomial case with our new features
data_ext
Improved considerably, 21% instead of 28%.
library(h2o)
h2o.init(nthreads = -1, max_mem_size = "2G")
gait.hex <- h2o.importFile(path = normalizePath("../data/subject01_walk1.csv"), destination_frame = "gait.hex")
dim(gait.hex)
summary(gait.hex)
gait.glrm <- h2o.glrm(training_frame = gait.hex, cols = 2:ncol(gait.hex), k = 10, loss = "Quadratic",
regularization_x = "None", regularization_y = "None", max_iterations = 1000)
plot(gait.glrm)
gait.y <- gait.glrm@model$archetypes
gait.y.mat <- as.matrix(gait.y)
x_coords <- seq(1, ncol(gait.y), by = 3)
y_coords <- seq(2, ncol(gait.y), by = 3)
feat_nams <- sapply(colnames(gait.y), function(nam) { substr(nam, 1, nchar(nam)-1) })
feat_nams <- as.character(feat_nams[x_coords])
for(k in 1:10) {
plot(gait.y.mat[k,x_coords], gait.y.mat[k,y_coords], xlab = "X-Coordinate Weight", ylab = "Y-Coordinate Weight", main = paste("Feature Weights of Archetype", k), col = "blue", pch = 19, lty = "solid")
text(gait.y.mat[k,x_coords], gait.y.mat[k,y_coords], labels = feat_nams, cex = 0.7, pos = 3)
cat("Press [Enter] to continue")
line <- readline()
}
gait.x <- h2o.getFrame(gait.glrm@model$representation_name)
time.df <- as.data.frame(gait.hex$Time[1:150])[,1]
gait.x.df <- as.data.frame(gait.x[1:150,])
matplot(time.df, gait.x.df, xlab = "Time", ylab = "Archetypal Projection", main = "Archetypes over Time", type = "l", lty = 1, col = 1:5)
legend("topright", legend = colnames(gait.x.df), col = 1:5, pch = 1)
gait.pred <- predict(gait.glrm, gait.hex)
head(gait.pred)
lacro.df <- as.data.frame(gait.hex$L.Acromium.X[1:150])
lacro.pred.df <- as.data.frame(gait.pred$reconstr_L.Acromium.X[1:150])
matplot(time.df, cbind(lacro.df, lacro.pred.df), xlab = "Time", ylab = "X-Coordinate of Left Acromium", main = "Position of Left Acromium over Time", type = "l", lty = 1, col = c(1,4))
legend("topright", legend = c("Original", "Reconstructed"), col = c(1,4), pch = 1)
gait.miss <- h2o.importFile(path = normalizePath("../data/subject01_walk1_miss15.csv"), destination_frame = "gait.miss")
dim(gait.miss)
summary(gait.miss)
sum(is.na(gait.miss))
gait.glrm2 <- h2o.glrm(training_frame = gait.miss, validation_frame = gait.hex, cols = 2:ncol(gait.miss), k = 10, init = "SVD", svd_method = "GramSVD",
loss = "Quadratic", regularization_x = "None", regularization_y = "None", max_iterations = 2000, min_step_size = 1e-6)
plot(gait.glrm2)
gait.pred2 <- predict(gait.glrm2, gait.miss)
head(gait.pred2)
sum(is.na(gait.pred2))
lacro.pred.df2 <- as.data.frame(gait.pred2$reconstr_L.Acromium.X[1:150])
matplot(time.df, cbind(lacro.df, lacro.pred.df2), xlab = "Time", ylab = "X-Coordinate of Left Acromium", main = "Position of Left Acromium over Time", type = "l", lty = 1, col = c(1,4))
legend("topright", legend = c("Original", "Imputed"), col = c(1,4), pch = 1)
lacro.miss.df <- as.data.frame(gait.miss$L.Acromium.X[1:150])
idx_miss <- which(is.na(lacro.miss.df))
points(time.df[idx_miss], lacro.df[idx_miss,1], col = 2, pch = 4, lty = 2)
library(h2o)
h2o.init(nthreads = -1, max_mem_size = "2G")
acs_orig <- h2o.importFile(path = "../data/ACS_13_5YR_DP02_cleaned.zip", col.types = c("enum", rep("numeric", 149)))
acs_zcta_col <- acs_orig$ZCTA5
acs_full <- acs_orig[,-which(colnames(acs_orig) == "ZCTA5")]
dim(acs_full)
summary(acs_full)
acs_model <- h2o.glrm(training_frame = acs_full, k = 10, transform = "STANDARDIZE",
loss = "Quadratic", regularization_x = "Quadratic",
regularization_y = "L1", max_iterations = 100, gamma_x = 0.25, gamma_y = 0.5)
plot(acs_model)
zcta_arch_x <- h2o.getFrame(acs_model@model$representation_name)
head(zcta_arch_x)
idx <- ((acs_zcta_col == "10065") | # Manhattan, NY (Upper East Side)
(acs_zcta_col == "11219") | # Manhattan, NY (East Harlem)
(acs_zcta_col == "66753") | # McCune, KS
(acs_zcta_col == "84104") | # Salt Lake City, UT
(acs_zcta_col == "94086") | # Sunnyvale, CA
(acs_zcta_col == "95014")) # Cupertino, CA
city_arch <- as.data.frame(zcta_arch_x[idx,1:2])
xeps <- (max(city_arch[,1]) - min(city_arch[,1])) / 10
yeps <- (max(city_arch[,2]) - min(city_arch[,2])) / 10
xlims <- c(min(city_arch[,1]) - xeps, max(city_arch[,1]) + xeps)
ylims <- c(min(city_arch[,2]) - yeps, max(city_arch[,2]) + yeps)
plot(city_arch[,1], city_arch[,2], xlim = xlims, ylim = ylims, xlab = "First Archetype", ylab = "Second Archetype", main = "Archetype Representation of Zip Code Tabulation Areas")
text(city_arch[,1], city_arch[,2], labels = c("Upper East Side", "East Harlem", "McCune", "Salt Lake City", "Sunnyvale", "Cupertino"), pos = 1)
whd_zcta <- h2o.importFile(path = "../data/whd_zcta_cleaned.zip", col.types = c(rep("enum", 7), rep("numeric", 97)))
dim(whd_zcta)
summary(whd_zcta)
split <- h2o.runif(whd_zcta)
train <- whd_zcta[split <= 0.8,]
test <- whd_zcta[split > 0.8,]
myY <- "flsa_repeat_violator"
myX <- setdiff(5:ncol(train), which(colnames(train) == myY))
orig_time <- system.time(dl_orig <- h2o.deeplearning(x = myX, y = myY, training_frame = train,
validation_frame = test, distribution = "multinomial",
epochs = 0.1, hidden = c(50,50,50)))
zcta_arch_x$zcta5_cd <- acs_zcta_col
whd_arch <- h2o.merge(whd_zcta, zcta_arch_x, all.x = TRUE, all.y = FALSE)
whd_arch$zcta5_cd <- NULL
summary(whd_arch)
train_mod <- whd_arch[split <= 0.8,]
test_mod <- whd_arch[split > 0.8,]
myX <- setdiff(5:ncol(train_mod), which(colnames(train_mod) == myY))
mod_time <- system.time(dl_mod <- h2o.deeplearning(x = myX, y = myY, training_frame = train_mod,
validation_frame = test_mod, distribution = "multinomial",
epochs = 0.1, hidden = c(50,50,50)))
colnames(acs_orig)[1] <- "zcta5_cd"
whd_acs <- h2o.merge(whd_zcta, acs_orig, all.x = TRUE, all.y = FALSE)
whd_acs$zcta5_cd <- NULL
summary(whd_acs)
train_comb <- whd_acs[split <= 0.8,]
test_comb <- whd_acs[split > 0.8,]
myX <- setdiff(5:ncol(train_comb), which(colnames(train_comb) == myY))
comb_time <- system.time(dl_comb <- h2o.deeplearning(x = myX, y = myY, training_frame = train_comb,
validation_frame = test_comb, distribution = "multinomial",
epochs = 0.1, hidden = c(50,50,50)))
data.frame(original = c(orig_time[3], h2o.logloss(dl_orig, train = TRUE), h2o.logloss(dl_orig, valid = TRUE)),
reduced = c(mod_time[3], h2o.logloss(dl_mod, train = TRUE), h2o.logloss(dl_mod, valid = TRUE)),
combined = c(comb_time[3], h2o.logloss(dl_comb, train = TRUE), h2o.logloss(dl_comb, valid = TRUE)),
row.names = c("runtime", "train_logloss", "test_logloss"))
ProductId
, UserId
, Time
, etc.
Analyze our initial model - AUC, confusion matrix, variable importance, partial dependency plots
> library(h2o)
> h2o.init(nthreads = -1)
> # Download the data into the pums2013 directory if necessary.
> pumsdir <- "pums2013"
> if (! file.exists(pumsdir)) {
> dir.create(pumsdir)
> }
> trainfile <- file.path(pumsdir, "adult_2013_train.csv.gz")
> if (! file.exists(trainfile)) {
> download.file("http://h2o-training.s3.amazonaws.com/pums2013/adult_2013_train.csv.gz", trainfile)
> }
> testfile <- file.path(pumsdir, "adult_2013_test.csv.gz")
> if (! file.exists(testfile)) {
> download.file("http://h2o-training.s3.amazonaws.com/pums2013/adult_2013_test.csv.gz", testfile)
> }
Load the datasets (change the directory to reflect where you stored these files):
> adult_2013_train <- h2o.importFile(trainfile, destination_frame = "adult_2013_train")
> adult_2013_test <- h2o.importFile(testfile, destination_frame = "adult_2013_test")
Looking at the data, we can see that 8 columns are using integer codes to represent different categorical levels.
Let's tell H2O to treat those columns as factors.
> actual_log_wagp <- h2o.assign(adult_2013_test[, "LOG_WAGP"], key = "actual_log_wagp")
> for (j in c("COW", "SCHL", "MAR", "INDP", "RELP", "RAC1P", "SEX", "POBP")) {
> adult_2013_train[[j]] <- as.factor(adult_2013_train[[j]])
> adult_2013_test[[j]] <- as.factor(adult_2013_test[[j]])
> }
> predset <- c("RELP", "SCHL", "COW", "MAR", "INDP", "RAC1P", "SEX", "POBP", "AGEP", "WKHP", "LOG_CAPGAIN", "LOG_CAPLOSS")
> log_wagp_gbm_grid <- h2o.gbm(x = predset,
y = "LOG_WAGP",
training_frame = adult_2013_train,
model_id = "GBMModel",
distribution = "gaussian",
max_depth = 5,
ntrees = 110,
validation_frame = adult_2013_test)
> log_wagp_gbm
Model Details:
==============
H2ORegressionModel: gbm
Model ID: GBMModel
Model Summary:
number_of_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
1 110.000000 111698.000000 5.000000 5.000000 5.00000 14.000000 32.000000 27.93636
H2ORegressionMetrics: gbm
** Reported on training data.
**
MSE: 0.4626122
R2 : 0.7362828
Mean Residual Deviance : 0.4626122
H2ORegressionMetrics: gbm
** Reported on validation data.
**
MSE: 0.6605266
R2 : 0.6290677
Mean Residual Deviance : 0.6605266
generated_model
.
> tmpdir_name <- "generated_model"
> dir.create(tmpdir_name)
> h2o.download_pojo(log_wagp_gbm, tmpdir_name)
[1] "POJO written to: generated_model/GBMModel.java"
At this point, the Java POJO is available for scoring data outside of H2O.
As the last step in R, let's take a look at the scores this model gives on the test data set.
We will use these to confirm the results in Hive.
> h2o.predict(log_wagp_gbm, adult_2013_test)
H2OFrame with 37345 rows and 1 column
First 10 rows:
predict
1 10.432787
2 10.244159
3 10.432688
4 9.604912
5 10.285979
6 10.356251
7 10.261413
8 10.046026
9 10.766078
10 9.502004
$ cp generated_model/h2o-genmodel.jar localjars
$ cp generated_model/GBMModel.java src/main/java/ai/h2o/hive/udf/GBMModel.java
GBMModel.java
, add:
package ai.h2o.hive.udf;
$ hadoop version
$ hive --version
And plug these into the <properties>
section of the pom.xml
file.
Currently, the configuration is set for pulling the necessary dependencies for Hortonworks.
For other Hadoop distributions, you will also need to update the <repositories>
section to reflect the respective repositories (a commented-out link to a Cloudera repository is included).
Caution: This tutorial was written using Maven 3.0.4. Older 2.x versions of Maven may not work.
$ mvn compile
$ mvn package
As with most Maven builds, the first run will probably seem like it is downloading the entire Internet.
It is just grabbing the needed compile dependencies.
In the end, this process should create the file target/ScoreData-1.0-SNAPSHOT.jar
.
As a part of the build process, Maven is running a unit test on the code.
If you are looking to use this template for your own models, you either need to modify the test to reflect your own data, or run Maven without the test (mvn package -Dmaven.test.skip=true
).
$ hadoop fs -mkdir hdfs://my-name-node:/user/myhomedir/UDFtest
$ hadoop fs -put adult_2013_test.csv.gz hdfs://my-name-node:/user/myhomedir/UDFtest/.
$ hive
Here we mark the table as EXTERNAL
so that Hive doesn't make a copy of the file needlessly.
We also tell Hive to ignore the first line, since it contains the column names.
> CREATE EXTERNAL TABLE adult_data_set (AGEP INT, COW STRING, SCHL STRING, MAR STRING, INDP STRING, RELP STRING, RAC1P STRING, SEX STRING, WKHP INT, POBP STRING, WAGP INT, CAPGAIN INT, CAPLOSS INT, LOG_CAPGAIN DOUBLE, LOG_CAPLOSS DOUBLE, LOG_WAGP DOUBLE, CENT_WAGP STRING, TOP_WAG2P INT, RELP_SCHL STRING) COMMENT 'PUMS 2013 test data' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE location '/user/myhomedir/UDFtest' tblproperties ("skip.header.line.count"="1");
> ANALYZE TABLE adult_data_set COMPUTE STATISTICS;
$ hadoop fs -put localjars/h2o-genmodel.jar hdfs://my-name-node:/user/myhomedir/
$ hadoop fs -put target/ScoreData-1.0-SNAPSHOT.jar hdfs://my-name-node:/user/myhomedir/
$ hive
Note that for correct class loading, you will need to load the h2o-model.jar before the ScoredData jar file.
> ADD JAR h2o-genmodel.jar;
> ADD JAR ScoreData-1.0-SNAPSHOT.jar;
> CREATE TEMPORARY FUNCTION scoredata AS 'ai.h2o.hive.udf.ScoreDataUDF';
Keep in mind that your UDF is only loaded in Hive for as long as you are using it.
If you quit;
and then join Hive again, you will have to re-enter the last three lines.
hive> SELECT scoredata(AGEP, COW, SCHL, MAR, INDP, RELP, RAC1P, SEX, WKHP, POBP, LOG_CAPGAIN, LOG_CAPLOSS) FROM adult_data_set LIMIT 10;
OK
10.476669
10.201586
10.463915
9.709603
10.175115
10.3576145
10.256757
10.050725
10.759903
9.316141
Time taken: 0.063 seconds, Fetched: 10 row(s)
SELECT scoredata(*) FROM adult_data_set;
and let the UDF pick out the relevant fields by name.
While the H2O POJO does have utility functions for this, Hive, on the other hand, doesn't provide UDF writers the names of the fields (as mentioned in this Hive feature request) from which the arguments originate.
Finally, as written, the UDF only returns a single prediction value.
The H2O POJO actually returns an array of float values.
The first value is the main prediction and the remaining values hold probability distributions for classifiers.
This code can easily be expanded to return all values if desired.
DESCRIBE scoredata
or DESCRIBE EXTENDED scoredata
.
@UDFType(deterministic = true, stateful = false)
@Description(name="scoredata", value="_FUNC_(*) - Returns a score for the given row",
extended="Example:\n"+"> SELECT scoredata(*) FROM target_data;")
Rather than extend the plain UDF class, this template extends GenericUDF.
The plain UDF requires that you hard code each of your input variables.
This is fine for most UDFs, but for a function like scoring the number of columns used in scoring may be large enough to make this cumbersome.
Note the declaration of an array to hold ObjectInspectors for each argument, as well as the instantiation of the model POJO.
class ScoreDataUDF extends GenericUDF {
private PrimitiveObjectInspector[] inFieldOI;
GBMModel p = new GBMModel();
@Override
public String getDisplayString(String[] args) {
return "scoredata("+Arrays.asList(p.getNames())+").";
}
All GenericUDF children must implement initialize() and evaluate().
In initialize(), we see very basic argument type checking, initialization of ObjectInspectors for each argument, and declaration of the return type for this UDF.
The accepted primitive type list here could easily be expanded if needed.
BOOLEAN, CHAR, VARCHAR, and possibly TIMESTAMP and DATE might be useful to add.
@Override
public ObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException {
// Basic argument count check
// Expects one less argument than model used; results column is dropped
if (args.length != p.getNumCols()) {
throw new UDFArgumentLengthException("Incorrect number of arguments." +
" scoredata() requires: "+ Arrays.asList(p.getNames())
+", in the listed order.
Received "+args.length+" arguments.");
}
//Check input types
inFieldOI = new PrimitiveObjectInspector[args.length];
PrimitiveObjectInspector.PrimitiveCategory pCat;
for (int i = 0; i < args.length; i++) {
if (args[i].getCategory() != ObjectInspector.Category.PRIMITIVE)
throw new UDFArgumentException("scoredata(...): Only takes primitive field types as parameters");
pCat = ((PrimitiveObjectInspector) args[i]).getPrimitiveCategory();
if (pCat != PrimitiveObjectInspector.PrimitiveCategory.STRING
& pCat != PrimitiveObjectInspector.PrimitiveCategory.DOUBLE
& pCat != PrimitiveObjectInspector.PrimitiveCategory.FLOAT
& pCat != PrimitiveObjectInspector.PrimitiveCategory.LONG
& pCat != PrimitiveObjectInspector.PrimitiveCategory.INT
& pCat != PrimitiveObjectInspector.PrimitiveCategory.SHORT)
throw new UDFArgumentException("scoredata(...): Cannot accept type: " + pCat.toString());
inFieldOI[i] = (PrimitiveObjectInspector) args[i];
}
// the return type of our function is a double, so we provide the correct object inspector
return PrimitiveObjectInspectorFactory.javaDoubleObjectInspector;
}
The real work is done in the evaluate() method.
Again, some quick sanity checks are made on the arguments, then each argument is converted to a double.
All H2O models take an array of doubles as their input.
For integers, a simple casting is enough.
For strings/enumerations, the double quotes are stripped, then the enumeration value for the given string/field index is retrieved, and then it is cast to a double.
Once all the arguments have been made into doubles, the model's predict() method is called to get a score.
The main prediction for this row is then returned.
@Override
public Object evaluate(DeferredObject[] record) throws HiveException {
// Expects one less argument than model used; results column is dropped
if (record != null) {
if (record.length == p.getNumCols()) {
double[] data = new double[record.length];
//Sadly, HIVE UDF doesn't currently make the field name available.
//Thus this UDF must depend solely on the arguments maintaining the same
// field order seen by the original H2O model creation.
for (int i = 0; i < record.length; i++) {
try {
Object o = inFieldOI[i].getPrimitiveJavaObject(record[i].get());
if (o instanceof java.lang.String) {
// Hive wraps strings in double quotes, remove
data[i] = p.mapEnum(i, ((String) o).replace("\", "));
if (data[i] == -1)
throw new UDFArgumentException("scoredata(...): The value " + (String) o
+ " is not a known category for column " + p.getNames()[i]);
} else if (o instanceof Double) {
data[i] = ((Double) o).doubleValue();
} else if (o instanceof Float) {
data[i] = ((Float) o).doubleValue();
} else if (o instanceof Long) {
data[i] = ((Long) o).doubleValue();
} else if (o instanceof Integer) {
data[i] = ((Integer) o).doubleValue();
} else if (o instanceof Short) {
data[i] = ((Short) o).doubleValue();
} else if (o == null) {
return null;
} else {
throw new UDFArgumentException("scoredata(...): Cannot accept type: "
+ o.getClass().toString() + " for argument # " + i + ".");
}
} catch (Throwable e) {
throw new UDFArgumentException("Unexpected exception on argument # " + i + ".
" + e.toString());
}
}
// get the predictions
try {
double[] preds = new double[p.getPredsSize()];
p.score0(data, preds);
return preds[0];
} catch (Throwable e) {
throw new UDFArgumentException("H2O predict function threw exception: " + e.toString());
}
} else {
throw new UDFArgumentException("Incorrect number of arguments." +
" scoredata() requires: " + Arrays.asList(p.getNames()) + ", in order.
Received "
+record.length+" arguments.");
}
} else { // record == null
return null; //throw new UDFArgumentException("scoredata() received a NULL row.");
}
}
Really, almost all the work is in type detection and conversion.
> library(h2o)
> h2o.init(nthreads = -1)
> # Download the data into the pums2013 directory if necessary.
> pumsdir <- "pums2013"
> if (! file.exists(pumsdir)) {
> dir.create(pumsdir)
> }
> trainfile <- file.path(pumsdir, "adult_2013_train.csv.gz")
> if (! file.exists(trainfile)) {
> download.file("http://h2o-training.s3.amazonaws.com/pums2013/adult_2013_train.csv.gz", trainfile)
> }
> testfile <- file.path(pumsdir, "adult_2013_test.csv.gz")
> if (! file.exists(testfile)) {
> download.file("http://h2o-training.s3.amazonaws.com/pums2013/adult_2013_test.csv.gz", testfile)
> }
Load the datasets (change the directory to reflect where you stored these files):
> adult_2013_train <- h2o.importFile(trainfile, destination_frame = "adult_2013_train")
> adult_2013_test <- h2o.importFile(testfile, destination_frame = "adult_2013_test")
Looking at the data, we can see that 8 columns are using integer codes to represent different categorical levels.
Let's tell H2O to treat those columns as factors.
> actual_log_wagp <- h2o.assign(adult_2013_test[, "LOG_WAGP"], key = "actual_log_wagp")
> for (j in c("COW", "SCHL", "MAR", "INDP", "RELP", "RAC1P", "SEX", "POBP")) {
> adult_2013_train[[j]] <- as.factor(adult_2013_train[[j]])
> adult_2013_test[[j]] <- as.factor(adult_2013_test[[j]])
> }
> predset <- c("RELP", "SCHL", "COW", "MAR", "INDP", "RAC1P", "SEX", "POBP", "AGEP", "WKHP", "LOG_CAPGAIN", "LOG_CAPLOSS")
> log_wagp_gbm_grid <- h2o.gbm(x = predset,
y = "LOG_WAGP",
training_frame = adult_2013_train,
model_id = "GBMModel",
distribution = "gaussian",
max_depth = 5,
ntrees = 110,
validation_frame = adult_2013_test)
> log_wagp_gbm
Model Details:
==============
H2ORegressionModel: gbm
Model ID: GBMModel
Model Summary:
number_of_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
1 110.000000 111698.000000 5.000000 5.000000 5.00000 14.000000 32.000000 27.93636
H2ORegressionMetrics: gbm
** Reported on training data.
**
MSE: 0.4626122
R2 : 0.7362828
Mean Residual Deviance : 0.4626122
H2ORegressionMetrics: gbm
** Reported on validation data.
**
MSE: 0.6605266
R2 : 0.6290677
Mean Residual Deviance : 0.6605266
generated_model
.
> tmpdir_name <- "generated_model"
> dir.create(tmpdir_name)
> h2o.download_mojo(log_wagp_gbm, tmpdir_name)
[1] "MOJO written to: generated_model/GBMModel.zip"
At this point, the Java MOJO is available for scoring data outside of H2O.
As the last step in R, let's take a look at the scores this model gives on the test data set.
We will use these to confirm the results in Hive.
> h2o.predict(log_wagp_gbm, adult_2013_test)
H2OFrame with 37345 rows and 1 column
First 10 rows:
predict
1 10.432787
2 10.244159
3 10.432688
4 9.604912
5 10.285979
6 10.356251
7 10.261413
8 10.046026
9 10.766078
10 9.502004
$ cp generated_model/h2o-genmodel.jar localjars
$ cd src/main/
$ mkdir resources
$ cp generated_model/GBMModel.zip src/main/java/resources/ai/h2o/hive/udf/GBMModel.zip
$ hadoop version
$ hive --version
And plug these into the <properties>
section of the pom.xml
file.
Currently, the configuration is set for pulling the necessary dependencies for Hortonworks.
For other Hadoop distributions, you will also need to update the <repositories>
section to reflect the respective repositories (a commented-out link to a Cloudera repository is included).
Caution: This tutorial was written using Maven 3.5.0. Older 2.x versions of Maven may not work.
$ mvn compile
$ mvn package -Dmaven.test.skip=true
As with most Maven builds, the first run will probably seem like it is downloading the entire Internet.
It is just grabbing the needed compile dependencies.
In the end, this process should create the file target/ScoreData-1.0-SNAPSHOT.jar
.
As a part of the build process, Maven is running a unit test on the code.
If you are looking to use this template for your own models, you either need to modify the test to reflect your own data, or run Maven without the test (mvn package -Dmaven.test.skip=true
).
$ hadoop fs -mkdir hdfs://my-name-node:/user/myhomedir/UDFtest
$ hadoop fs -put adult_2013_test.csv.gz hdfs://my-name-node:/user/myhomedir/UDFtest/.
$ hive
Here we mark the table as EXTERNAL
so that Hive doesn't make a copy of the file needlessly.
We also tell Hive to ignore the first line, since it contains the column names.
> CREATE EXTERNAL TABLE adult_data_set (AGEP INT, COW STRING, SCHL STRING, MAR STRING, INDP STRING, RELP STRING, RAC1P STRING, SEX STRING, WKHP INT, POBP STRING, WAGP INT, CAPGAIN INT, CAPLOSS INT, LOG_CAPGAIN DOUBLE, LOG_CAPLOSS DOUBLE, LOG_WAGP DOUBLE, CENT_WAGP STRING, TOP_WAG2P INT, RELP_SCHL STRING) COMMENT 'PUMS 2013 test data' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE location '/user/myhomedir/UDFtest' tblproperties ("skip.header.line.count"="1");
> ANALYZE TABLE adult_data_set COMPUTE STATISTICS;
$ hadoop fs -put localjars/h2o-genmodel.jar hdfs://my-name-node:/user/myhomedir/
$ hadoop fs -put target/ScoreData-1.0-SNAPSHOT.jar hdfs://my-name-node:/user/myhomedir/
$ hive
Note that for correct class loading, you will need to load the h2o-model.jar before the ScoredData jar file.
> ADD JAR h2o-genmodel.jar;
> ADD JAR ScoreData-1.0-SNAPSHOT.jar;
> CREATE TEMPORARY FUNCTION scoredata AS 'ai.h2o.hive.udf.ScoreDataUDF';
Keep in mind that your UDF is only loaded in Hive for as long as you are using it.
If you quit;
and then join Hive again, you will have to re-enter the last three lines.
hive> SELECT scoredata(AGEP, COW, SCHL, MAR, INDP, RELP, RAC1P, SEX, WKHP, POBP, LOG_CAPGAIN, LOG_CAPLOSS) FROM adult_data_set LIMIT 10;
OK
10.476669
10.201586
10.463915
9.709603
10.175115
10.3576145
10.256757
10.050725
10.759903
9.316141
Time taken: 0.063 seconds, Fetched: 10 row(s)
SELECT scoredata(*) FROM adult_data_set;
and let the UDF pick out the relevant fields by name.
While the H2O MOJO does have utility functions for this, Hive, on the other hand, doesn't provide UDF writers the names of the fields (as mentioned in this Hive feature request) from which the arguments originate.
Finally, as written, the UDF only returns a single prediction value.
The H2O MOJO actually returns an array of float values.
The first value is the main prediction and the remaining values hold probability distributions for classifiers.
This code can easily be expanded to return all values if desired.
DESCRIBE scoredata
or DESCRIBE EXTENDED scoredata
.
@UDFType(deterministic = true, stateful = false)
@Description(name="scoredata", value="_FUNC_(*) - Returns a score for the given row",
extended="Example:\n"+"> SELECT scoredata(*) FROM target_data;")
Rather than extend the plain UDF class, this template extends GenericUDF.
The plain UDF requires that you hard code each of your input variables.
This is fine for most UDFs, but for a function like scoring the number of columns used in scoring may be large enough to make this cumbersome.
Note the declaration of an array to hold ObjectInspectors for each argument, as well as the instantiation of the model MOJO.
class ScoreDataUDF extends GenericUDF {
private PrimitiveObjectInspector[] inFieldOI;
MojoModel p;
@Override
public String getDisplayString(String[] args) {
return "scoredata("+Arrays.asList(p.getNames())+").";
}
All GenericUDF children must implement initialize() and evaluate().
In initialize(), we see very basic argument type checking, initialization of ObjectInspectors for each argument, and declaration of the return type for this UDF.
The accepted primitive type list here could easily be expanded if needed.
BOOLEAN, CHAR, VARCHAR, and possibly TIMESTAMP and DATE might be useful to add.
@Override
public ObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException {
// Get the MOJO as a resource
URL mojoURL = ScoreDataUDF.class.getResource("GBMModel.zip");
// Declare r as a MojoReaderBackend
MojoReaderBackend r;
// Read the MOJO and assign it to p
try {
r = MojoReaderBackendFactory.createReaderBackend(mojoURL, CachingStrategy.MEMORY);
p = ModelMojoReader.readFrom(r);
} catch (IOException e) {
throw new RuntimeException(e);
}
// Basic argument count check
// Expects one less argument than model used; results column is dropped
if (args.length != p.getNumCols()) {
throw new UDFArgumentLengthException("Incorrect number of arguments." +
" scoredata() requires: "+ Arrays.asList(p.getNames())
+", in the listed order.
Received "+args.length+" arguments.");
}
//Check input types
inFieldOI = new PrimitiveObjectInspector[args.length];
PrimitiveObjectInspector.PrimitiveCategory pCat;
for (int i = 0; i < args.length; i++) {
if (args[i].getCategory() != ObjectInspector.Category.PRIMITIVE)
throw new UDFArgumentException("scoredata(...): Only takes primitive field types as parameters");
pCat = ((PrimitiveObjectInspector) args[i]).getPrimitiveCategory();
if (pCat != PrimitiveObjectInspector.PrimitiveCategory.STRING
& pCat != PrimitiveObjectInspector.PrimitiveCategory.DOUBLE
& pCat != PrimitiveObjectInspector.PrimitiveCategory.FLOAT
& pCat != PrimitiveObjectInspector.PrimitiveCategory.LONG
& pCat != PrimitiveObjectInspector.PrimitiveCategory.INT
& pCat != PrimitiveObjectInspector.PrimitiveCategory.SHORT)
throw new UDFArgumentException("scoredata(...): Cannot accept type: " + pCat.toString());
inFieldOI[i] = (PrimitiveObjectInspector) args[i];
}
// the return type of our function is a double, so we provide the correct object inspector
return PrimitiveObjectInspectorFactory.javaDoubleObjectInspector;
}
The real work is done in the evaluate() method.
Again, some quick sanity checks are made on the arguments, then each argument is converted to a double.
All H2O models take an array of doubles as their input.
For integers, a simple casting is enough.
For strings/enumerations, the double quotes are stripped, then the enumeration value for the given string/field index is retrieved, and then it is cast to a double.
Once all the arguments have been made into doubles, the model's predict() method is called to get a score.
The main prediction for this row is then returned.
@Override
public Object evaluate(DeferredObject[] record) throws HiveException {
// Expects one less argument than model used; results column is dropped
if (record != null) {
if (record.length == p.getNumCols()) {
double[] data = new double[record.length];
//Sadly, HIVE UDF doesn't currently make the field name available.
//Thus this UDF must depend solely on the arguments maintaining the same
// field order seen by the original H2O model creation.
for (int i = 0; i < record.length; i++) {
try {
Object o = inFieldOI[i].getPrimitiveJavaObject(record[i].get());
if (o instanceof java.lang.String) {
// Hive wraps strings in double quotes, remove
data[i] = p.mapEnum(i, ((String) o).replace("\", "));
if (data[i] == -1)
throw new UDFArgumentException("scoredata(...): The value " + (String) o
+ " is not a known category for column " + p.getNames()[i]);
} else if (o instanceof Double) {
data[i] = ((Double) o).doubleValue();
} else if (o instanceof Float) {
data[i] = ((Float) o).doubleValue();
} else if (o instanceof Long) {
data[i] = ((Long) o).doubleValue();
} else if (o instanceof Integer) {
data[i] = ((Integer) o).doubleValue();
} else if (o instanceof Short) {
data[i] = ((Short) o).doubleValue();
} else if (o == null) {
return null;
} else {
throw new UDFArgumentException("scoredata(...): Cannot accept type: "
+ o.getClass().toString() + " for argument # " + i + ".");
}
} catch (Throwable e) {
throw new UDFArgumentException("Unexpected exception on argument # " + i + ".
" + e.toString());
}
}
// get the predictions
try {
double[] preds = new double[p.getPredsSize()];
p.score0(data, preds);
return preds[0];
} catch (Throwable e) {
throw new UDFArgumentException("H2O predict function threw exception: " + e.toString());
}
} else {
throw new UDFArgumentException("Incorrect number of arguments." +
" scoredata() requires: " + Arrays.asList(p.getNames()) + ", in order.
Received "
+record.length+" arguments.");
}
} else { // record == null
return null; //throw new UDFArgumentException("scoredata() received a NULL row.");
}
}
Really, almost all the work is in type detection and conversion.
h2o.stackedEnsemble
function.
This demo uses a subset of the HIGGS dataset, which has 28 numeric features and a binary response.
The machine learning task in this example is to distinguish between a signal process which produces Higgs bosons (Y = 1) and a background process which does not (Y = 0).
The dataset contains approximately the same number of positive vs negative examples.
In other words, this is a balanced, rather than imbalanced, dataset.
To run this script, be sure to setwd()
to the location of this script.
h2o.init()
starts H2O in R’s current working directory.
h2o.importFile()
looks for files from the perspective of where H2O was started.
library(h2o)
h2o.init()
# Import a sample binary outcome train/test set into H2O
train <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
Identify predictors and response:
y <- "response"
x <- setdiff(names(train), y)
For binary classification, the response should be encoded as a factor type (also known as the enum type in Java or categorial in Python Pandas).
The user can specify column types in the h2o.importFile
command, or you can convert the response column as follows:
train[,y] <- as.factor(train[,y])
test[,y] <- as.factor(test[,y])
Number of CV folds (to generate level-one data for stacking):
nfolds <- 5
keep_cross_validation_predctions = TRUE
because the Stacked Ensemble algorithm requires the cross-validation predictions to train the metalaerner algorithm (unless you use a blending frame.
# Train & cross-validate a GBM:
my_gbm <- h2o.gbm(x = x,
y = y,
training_frame = train,
distribution = “bernoulli”,
ntrees = 10,
max_depth = 3,
min_rows = 2,
learn_rate = 0.2,
nfolds = nfolds,
keep_cross_validation_predictions = TRUE,
seed = 1)
# Train & cross-validate a RF:
my_rf <- h2o.randomForest(x = x,
y = y,
training_frame = train,
ntrees = 50,
nfolds = nfolds,
keep_cross_validation_predictions = TRUE,
seed = 1)
# Train a stacked ensemble using the GBM and RF above:
ensemble <- h2o.stackedEnsemble(x = x,
y = y,
training_frame = train,
base_models = list(my_gbm, my_rf))
perf <- h2o.performance(ensemble, newdata = test)
ensemble_auc_test <- h2o.auc(perf)
perf_gbm_test <- h2o.performance(my_gbm, newdata = test)
perf_rf_test <- h2o.performance(my_rf, newdata = test)
baselearner_best_auc_test <- max(h2o.auc(perf_gbm_test),
h2o.auc(perf_rf_test))
print(sprintf(“Best Base-learner Test AUC: %s”, baselearner_best_auc_test))
print(sprintf(“Ensemble Test AUC: %s”, ensemble_auc_test))
# [1] "Best Base-learner Test AUC: 0.76979821502548"
# [1] "Ensemble Test AUC: 0.773501212640419"
So we see the best individual algorithm in this group is the GBM with a test set AUC of 0.7735, as compared to 0.7698 for the ensemble.
At first thought, this might not seem like much, but in many industries like medicine or finance, this small advantage can be highly valuable.
To increase the performance of the ensemble, we have several options.
One of them is to increase the number of cross-validation folds using the nfolds
argument.
The other options are to change the base learner library or the metalearning algorithm.
pred <- h2o.predict(ensemble, newdata = test)
# GBM Hyperparamters
learn_rate_opt <- c(0.01, 0.03)
max_depth_opt <- c(3, 4, 5, 6, 9)
sample_rate_opt <- c(0.7, 0.8, 0.9, 1.0)
col_sample_rate_opt <- c(0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8)
hyper_params <- list(learn_rate = learn_rate_opt,
max_depth = max_depth_opt,
sample_rate = sample_rate_opt,
col_sample_rate = col_sample_rate_opt)
search_criteria <- list(strategy = "RandomDiscrete",
max_models = 3,
seed = 1)
gbm_grid <- h2o.grid(algorithm = "gbm",
grid_id = "gbm_grid_binomial",
x = x,
y = y,
training_frame = train,
ntrees = 10,
seed = 1,
nfolds = nfolds,
keep_cross_validation_predictions = TRUE,
hyper_params = hyper_params,
search_criteria = search_criteria)
# Train a stacked ensemble using the GBM grid
ensemble <- h2o.stackedEnsemble(x = x,
y = y,
training_frame = train,
base_models = gbm_grid@model_ids)
# Eval ensemble performance on a test set
perf <- h2o.performance(ensemble, newdata = test)
# Compare to base learner performance on the test set
.getauc <- function(mm) h2o.auc(h2o.performance(h2o.getModel(mm), newdata = test))
baselearner_aucs <- sapply(gbm_grid@model_ids, .getauc)
baselearner_best_auc_test <- max(baselearner_aucs)
ensemble_auc_test <- h2o.auc(perf)
print(sprintf("Best Base-learner Test AUC: %s", baselearner_best_auc_test))
print(sprintf("Ensemble Test AUC: %s", ensemble_auc_test))
# [1] "Best Base-learner Test AUC: 0.748146530400473"
# [1] "Ensemble Test AUC: 0.773501212640419"
# Generate predictions on a test set (if neccessary)
pred <- h2o.predict(ensemble, newdata = test)
h2o.shutdown()
git clone https://github.com/apache/storm.git
git clone https://github.com/h2oai/h2o-world-2015-training.git
NOTE: Building storm (c.f.
Section 5) requires Maven.
You can install Maven (version 3.x) by following the Maven installation instructions.
Navigate to the directory for this tutorial inside the h2o-world-2015-training repository:
cd h2o-world-2015-training/tutorials/streaming/storm
You should see the following files in this directory:
README.md (This document)
example.R (The R script that builds the GBM model and exports it as a Java POJO)
training_data.csv (The data used to build the GBM model)
live_data.csv (The data that predictions are made on; used to feed the spout in the Storm topology)
H2OStormStarter.java (The Storm topology with two bolts: a prediction bolt and a classifying bolt)
TestH2ODataSpout.java (The Storm spout which reads data from the live_data.csv file and passes each observation to the prediction bolt one observation at a time; this simulates the arrival of data in real-time)
And the following directories:
premade_generated_model (For those people who have trouble building the model but want to try running with Storm anyway; you can ignore this directory if you successfully build your own generated_model later in the tutorial)
images (Images for the tutorial documentation, you can ignore these)
web (Contains the html and image files for watching the real-time prediction output (c.f.
Section 8))
Note: The H2O package for R includes both the R code as well as the main H2O jar file. This is all you need to run H2O locally on your laptop.Step 1: Start R (at the command line or via RStudio) Step 2: Install H2O from CRAN
install.packages("h2o")
Note: For convenience, this tutorial was created with the Slater stable release of H2O (3.2.0.3) from CRAN, as shown above.
Later versions of H2O will also work.
brew install npm
)
http-server (npm install http-server -g
)
A modern web browser (animations depend on D3)
head -n 20 training_data.csv
Label | Has4Legs | CoatColor | HairLength | TailLength | EnjoysPlay | StaresOutWindow | HoursSpentNapping | RespondsToCommands | EasilyFrightened | Age | Noise1 | Noise2 | Noise3 | Noise4 | Noise5 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
dog | 1 | Brown | 0 | 2 | 1 | 1 | 2 | 1 | 0 | 4 | 0.852352328598499 | 0.229839221341535 | 0.576096264412627 | 0.0105558061040938 | 0.470826978096738 |
dog | 1 | Brown | 1 | 1 | 1 | 1 | 5 | 0 | 0 | 16 | 0.928460991941392 | 0.98618565662764 | 0.553872474469244 | 0.932764369761571 | 0.435074317501858 |
dog | 1 | Grey | 1 | 10 | 1 | 1 | 2 | 1 | 0 | 5 | 0.658247262472287 | 0.379703616956249 | 0.767817151267081 | 0.840509128058329 | 0.538852979661897 |
dog | 1 | Grey | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 2 | 0.210346511797979 | 0.912498287158087 | 0.757371880114079 | 0.915149037493393 | 0.27393517526798 |
dog | 1 | Brown | 1 | 5 | 1 | 0 | 10 | 1 | 0 | 20 | 0.770219849422574 | 0.999768516747281 | 0.482816896401346 | 0.904691722244024 | 0.232283475110307 |
cat | 1 | Grey | 1 | 6 | 1 | 1 | 3 | 0 | 1 | 10 | 0.499049366684631 | 0.690937616396695 | 0.00580681697465479 | 0.516113663092256 | 0.161103375256062 |
dog | 1 | Spotted | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 17 | 0.980622073402628 | 0.193929805886 | 0.50500241224654 | 0.848579460754991 | 0.750856031663716 |
cat | 1 | Spotted | 1 | 7 | 0 | 1 | 5 | 0 | 1 | 9 | 0.298585452139378 | 0.425832540960982 | 0.816698056645691 | 0.0246927759144455 | 0.692579888971522 |
dog | 1 | Grey | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 3 | 0.724013194208965 | 0.120883409865201 | 0.754467910155654 | 0.43663241318427 | 0.0592612794134766 |
cat | 1 | Black | 0 | 7 | 0 | 1 | 5 | 0 | 1 | 5 | 0.849093642551452 | 0.0961945767048746 | 0.588080670218915 | 0.0478771082125604 | 0.211781785823405 |
dog | 1 | Grey | 1 | 1 | 1 | 0 | 2 | 0 | 1 | 1 | 0.362678906414658 | 0.54775956296362 | 0.522148486692458 | 0.903857592027634 | 0.496479033492506 |
dog | 1 | Spotted | 0 | 1 | 1 | 1 | 2 | 1 | 0 | 3 | 0.745238043367863 | 0.0181446429342031 | 0.33444849960506 | 0.550831729080528 | 0.625747208483517 |
dog | 1 | Spotted | 1 | 4 | 1 | 1 | 2 | 1 | 0 | 20 | 0.693285189568996 | 0.69526576064527 | 0.386858200887218 | 0.235119538847357 | 0.401590927504003 |
cat | 1 | Spotted | 1 | 8 | 1 | 1 | 3 | 0 | 0 | 3 | 0.695167713565752 | 0.81692309374921 | 0.530564708868042 | 0.081766308285296 | 0.277844901895151 |
cat | 1 | White | 1 | 8 | 0 | 1 | 5 | 0 | 0 | 3 | 0.0237249641213566 | 0.867370987776667 | 0.855278167175129 | 0.284646768355742 | 0.566314383875579 |
cat | 1 | Black | 1 | 5 | 1 | 1 | 2 | 0 | 1 | 16 | 0.281967194052413 | 0.798100406536832 | 0.306403951486573 | 0.681048742029816 | 0.237810888560489 |
cat | 1 | Grey | 1 | 7 | 1 | 1 | 3 | 1 | 1 | 16 | 0.178538456792012 | 0.566589535912499 | 0.297640548087656 | 0.634627313353121 | 0.677242929581553 |
cat | 1 | Spotted | 1 | 8 | 0 | 0 | 10 | 0 | 1 | 3 | 0.219212393043563 | 0.482533045113087 | 0.739678716054186 | 0.132942436495796 | 0.100684949662536 |
example.R
script builds the model and exports the Java POJO to the generated_model temporary directory.
Run example.R
at the command line as follows:
R -f example.R
You will see the following output:
R version 3.2.2 (2015-08-14) -- "Fire Safety"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin13.4.0 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> #
> # Example R code for generating an H2O Scoring POJO.
> #
>
> # "Safe" system.
Error checks process exit status code.
stop() if it failed.
> safeSystem <- function(x)="" {="" +="" print(sprintf("+="" CMD:="" %s",="" x))="" res="" <-="" system(x)="" print(res)="" if="" (res="" !="0)" msg="" sprintf("SYSTEM="" COMMAND="" FAILED="" (exit="" status="" %d)",="" res)="" stop(msg)="" }="">
> library(h2o)
Loading required package: statmod
----------------------------------------------------------------------
Your next step is to start H2O:
> h2o.init()
For H2O package documentation, ask for help:
> ??h2o
After starting H2O, you can use the Web UI at http://localhost:54321
For more information visit http://docs.h2o.ai
----------------------------------------------------------------------
Attaching package: ‘h2o’
The following objects are masked from ‘package:stats’:
sd, var
The following objects are masked from ‘package:base’:
%*%, %in%, apply, as.factor, as.numeric, colnames, colnames<-, ifelse,="" is.factor,="" is.numeric,="" log,="" range,="" trunc="">
> cat("Starting H2O\n")
Starting H2O
> myIP <- "localhost"=""> myPort <- 54321=""> h <- 1="" 2="" 8="" 738="" h2o.init(ip="myIP," port="myPort," startH2O="TRUE)" H2O="" is="" not="" running="" yet,="" starting="" it="" now...="" Note:="" In="" case="" of="" errors="" look="" at="" the="" following="" log="" files:="" var="" folders="" ct="" mv0lk53d5lq6bkvm_2snjgm00000gn="" T="" RtmpkEUbAR="" h2o_ludirehak_started_from_r.out="" h2o_ludirehak_started_from_r.err="" java="" version="" "1.7.0_79"="" Java(TM)="" SE="" Runtime="" Environment="" (build="" 1.7.0_79-b15)="" Java="" HotSpot(TM)="" 64-Bit="" Server="" VM="" 24.79-b02,="" mixed="" mode)="" ..Successfully="" connected="" to="" http:="" localhost:54321="" R="" cluster:="" cluster="" uptime:="" seconds="" milliseconds="" version:="" 3.3.0.99999="" name:="" H2O_started_from_R_ludirehak_dwh703="" total="" nodes:="" memory:="" 3.56="" GB="" cores:="" allowed="" healthy:="" TRUE="" As="" started,="" limited="" CRAN="" default CPUs.="" Shut="" down="" and="" restart="" as="" shown="" below="" use="" all="" your=""> h2o.shutdown()
> h2o.init(nthreads = -1)
>
> cat("Building GBM model\n")
Building GBM model
> df <- h2o.importFile(path="normalizePath("./training_data.csv"));" |="=====================================================================|" 100%=""> y <- "Label"=""> x <- c("Has4Legs","CoatColor","HairLength","TailLength","EnjoysPlay","StaresOutWindow","HoursSpentNapping","RespondsToCommands","EasilyFrightened","Age",="" "Noise1",="" "Noise2",="" "Noise3",="" "Noise4",="" "Noise5")=""> gbm.h2o.fit <- h2o.gbm(training_frame="df," y="y," x="x," model_id="GBMPojo" ,="" ntrees="10)" |="=====================================================================|" 100%="">
> cat("Downloading Java prediction model code from H2O\n")
Downloading Java prediction model code from H2O
> model_id <- gbm.h2o.fit@model_id="">
> tmpdir_name <- "generated_model"=""> cmd <- sprintf("rm="" -fr="" %s",="" tmpdir_name)=""> safeSystem(cmd)
[1] "+ CMD: rm -fr generated_model"
[1] 0
> cmd <- sprintf("mkdir="" %s",="" tmpdir_name)=""> safeSystem(cmd)
[1] "+ CMD: mkdir generated_model"
[1] 0
>
> h2o.download_pojo(gbm.h2o.fit, "./generated_model/")
[1] "POJO written to: ./generated_model//GBMPojo.java"
>
> cat("Note: H2O will shut down automatically if it was started by this R script and the script exits\n")
Note: H2O will shut down automatically if it was started by this R script and the script exits
>
->->->->->->->->->->->-,>->
ls -l generated_model
ls -l generated_model/
total 72
-rw-r--r-- 1 ludirehak staff 19764 Sep 25 12:36 GBMPojo.java
-rw-r--r-- 1 ludirehak staff 23655 Sep 25 12:36 h2o-genmodel.jar
The h2o-genmodel.jar file contains the interface definition, and the GBMPojo.java file contains the Java code for the POJO model.
The following three sections from the generated model are of special importance.
public class GBMPojo extends GenModel {
This is the class to instantiate in the Storm bolt to make predictions.
public final double[] score0( double[] data, double[] preds )
score0() is the method to call to make a single prediction for a new observation.
data is the input, and preds is the output.
The return value is just preds, and can be ignored.
Inputs and Outputs must be numerical.
Categorical columns must be translated into numerical values using the DOMAINS mapping on the way in.
Even if the response is categorical, the result will be numerical.
It can be mapped back to a level string using DOMAINS, if desired.
When the response is categorical, the preds response is structured as follows:
preds[0] contains the predicted level number
preds[1] contains the probability that the observation is level0
preds[2] contains the probability that the observation is level1
...
preds[N] contains the probability that the observation is levelN-1
sum(preds[1] ...
preds[N]) == 1.0
In this specific case, that means:
preds[0] contains 0 or 1
preds[1] contains the probability that the observation is ColInfo_15.VALUES[0]
preds[2] contains the probability that the observation is ColInfo_15.VALUES[1]
// Column domains.
The last array contains domain of response column.
public static final String[][] DOMAINS = new String[][] {
/* Has4Legs */ null,
/* CoatColor */ GBMPojo_ColInfo_1.VALUES,
/* HairLength */ null,
/* TailLength */ null,
/* EnjoysPlay */ null,
/* StaresOutWindow */ null,
/* HoursSpentNapping */ null,
/* RespondsToCommands */ null,
/* EasilyFrightened */ null,
/* Age */ null,
/* Noise1 */ null,
/* Noise2 */ null,
/* Noise3 */ null,
/* Noise4 */ null,
/* Noise5 */ null,
/* Label */ GBMPojo_ColInfo_15.VALUES
};
The DOMAINS array contains information about the level names of categorical columns.
Note that Label (the column we are predicting) is the last entry in the DOMAINS array.
cd storm & mvn clean install -DskipTests=true
Once storm is built, open up your favorite IDE to start building the h2o streaming topology.
In this tutorial, we will be using IntelliJ.
To import the storm-starter project into your IntelliJ please follow these screenshots:
Click on "Import Project" and find the storm repo.
Select storm-starter and click "OK"PATH_TO_H2O_WORLD_2015_TRAINING/h2o-world-2015-training/tutorials/streaming/storm/web/out
Likewise, edit L46 of TestH2ODataSpout.java so that the file path is: PATH_TO_H2O_WORLD_2015_TRAINING/h2o-world-2015-training/tutorials/streaming/storm/live_data.csv
Now copy.
cp H2OStormStarter.java /PATH_TO_STORM/storm/examples/storm-starter/test/jvm/storm/starter/
cp TestH2ODataSpout.java /PATH_TO_STORM/storm/examples/storm-starter/test/jvm/storm/starter/
Your project should now look like this:
package storm.starter;
as the first line.
sed -i -e '1i\'$'\n''package storm.starter;'$'\n' ./generated_model/GBMPojo.java
We now copy over the POJO from section 4 into our storm project.
cp ./generated_model/GBMPojo.java /PATH_TO_STORM/storm/examples/storm-starter/test/jvm/storm/starter/
OR if you were not able to build the GBMPojo, copy over the pre-made version:
cp ./premade_generated_model/GBMPojo.java /PATH_TO_STORM/storm/examples/storm-starter/test/jvm/storm/starter/
If copying over the pre-made version of GBMPojo, also repeat the above steps in this section to import the pre-made h2o-genmodel.jar from the premade_generated_model directory.
Your storm-starter project directory should now look like this:
@Override public void execute(Tuple tuple) {
GBMPojo p = new GBMPojo();
// get the input tuple as a String[]
ArrayList<String> vals_string = new ArrayList<String>();
for (Object v : tuple.getValues()) vals_string.add((String)v);
String[] raw_data = vals_string.toArray(new String[vals_string.size()]);
// the score pojo requires a single double[] of input.
// We handle all of the categorical mapping ourselves
double data[] = new double[raw_data.length-1]; //drop the Label
String[] colnames = tuple.getFields().toList().toArray(new String[tuple.size()]);
// if the column is a factor column, then look up the value, otherwise put the double
for (int i = 1; i < raw_data.length; ++i) {
data[i-1] = p.getDomainValues(colnames[i]) == null
? Double.valueOf(raw_data[i])
: p.mapEnum(p.getColIdx(colnames[i]), raw_data[i]);
}
// get the predictions
double[] preds = new double [GBMPojo.NCLASSES+1];
//p.predict(data, preds);
p.score0(data, preds);
// emit the results
_collector.emit(tuple, new Values(raw_data[0], preds[1]));
_collector.ack(tuple);
}
The probability emitted is the probability of being a 'dog'.
We use this probability to decide whether the observation is of type 'cat' or 'dog' depending on some threshold.
This threshold was chosen such that the F1 score was maximized for the testing data (please see AUC and/or h2o.performance() from R).
The ClassifierBolt then looks like:
public static class ClassifierBolt extends BaseRichBolt {
OutputCollector _collector;
final double _thresh = 0.54;
@Override
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
_collector = collector;
}
@Override
public void execute(Tuple tuple) {
String expected=tuple.getString(0);
double dogProb = tuple.getDouble(1);
String content = expected + "," + (dogProb <= _thresh ? "dog" : "cat");
try {
File file = new File("/Users/ludirehak/other_h2o/h2o-world-2015-training/tutorials/streaming/storm/web/out");
if (!file.exists()) file.createNewFile();
FileWriter fw = new FileWriter(file.getAbsoluteFile());
BufferedWriter bw = new BufferedWriter(fw);
bw.write(content);
bw.close();
} catch (IOException e) {
e.printStackTrace();
}
_collector.emit(tuple, new Values(expected, dogProb <= _thresh ? "dog" : "cat"));
_collector.ack(tuple);
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("expected_class", "class"));
}
}
brew install npm
npm install http-server -g
Once these are installed, you may navigate to the web directory and start the server:
cd web
http-server -p 4040 -c-1
Now open up your browser and navigate to http://localhost:4040.
Requires a modern browser (depends on D3 for animation).
Here's a short video showing what it looks like all together.
Enjoy!
cd "path/to/sparkling/water"
export SPARK_HOME="/path/to/spark/installation"
export MASTER="local-cluster[3,2,4096]"
bin/sparkling-shell --conf spark.executor.memory=2G
Note: To avoid flooding output with Spark INFO messages, I recommend editing yourOpen Spark UI: Go to http://localhost:4040/ to see the Spark status. Prepare the environment:$SPARK_HOME/conf/log4j.properties
and configuring the log level toWARN
.
// Input data
val DATAFILE="../data/smsData.txt"
// Common imports from H2O and Sparks
import _root_.hex.deeplearning.{DeepLearningModel, DeepLearning}
import _root_.hex.deeplearning.DeepLearningParameters
import org.apache.spark.examples.h2o.DemoUtils._
import org.apache.spark.h2o._
import org.apache.spark.mllib
import org.apache.spark.mllib.feature.{IDFModel, IDF, HashingTF}
import org.apache.spark.rdd.RDD
import water.Key
Define the representation of the training message:
// Representation of a training message
case class SMS(target: String, fv: mllib.linalg.Vector)
Define the data loader and parser:
def load(dataFile: String): RDD[Array[String]] = {
// Load file into memory, split on TABs and filter all empty lines
sc.textFile(dataFile).map(l => l.split("\t")).filter(r => !r(0).isEmpty)
}
Define the input messages tokenizer:
// Tokenizer
// For each sentence in input RDD it provides array of string representing individual interesting words in the sentence
def tokenize(dataRDD: RDD[String]): RDD[Seq[String]] = {
// Ignore all useless words
val ignoredWords = Seq("the", "a", ", "in", "on", "at", "as", "not", "for")
// Ignore all useless characters
val ignoredChars = Seq(',', ':', ';', '/', '<', '>', '"', '.', '(', ')', '?', '-', '\'','!','0', '1')
// Invoke RDD API and transform input data
val textsRDD = dataRDD.map( r => {
// Get rid of all useless characters
var smsText = r.toLowerCase
for( c <- ignoredChars) {
smsText = smsText.replace(c, ' ')
}
// Remove empty and uninteresting words
val words = smsText.split(" ").filter(w => !ignoredWords.contains(w) & w.length>2).distinct
words.toSeq
})
textsRDD
}
Configure Spark's Tf-IDF model builder:
def buildIDFModel(tokensRDD: RDD[Seq[String]],
minDocFreq:Int = 4,
hashSpaceSize:Int = 1 << 10): (HashingTF, IDFModel, RDD[mllib.linalg.Vector]) = {
// Hash strings into the given space
val hashingTF = new HashingTF(hashSpaceSize)
val tf = hashingTF.transform(tokensRDD)
// Build term frequency-inverse document frequency model
val idfModel = new IDF(minDocFreq = minDocFreq).fit(tf)
val expandedTextRDD = idfModel.transform(tf)
(hashingTF, idfModel, expandedTextRDD)
}
Wikipedia defines TF-IDF as: "tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general."Configure H2O's DeepLearning model builder:
def buildDLModel(trainHF: Frame, validHF: Frame,
epochs: Int = 10, l1: Double = 0.001, l2: Double = 0.0,
hidden: Array[Int] = Array[Int](200, 200))
(implicit h2oContext: H2OContext): DeepLearningModel = {
import h2oContext._
import _root_.hex.deeplearning.DeepLearning
import _root_.hex.deeplearning.DeepLearningParameters
// Create algorithm parameres
val dlParams = new DeepLearningParameters()
// Name for target model
dlParams._model_id = Key.make("dlModel.hex")
// Training dataset
dlParams._train = trainHF
// Validation dataset
dlParams._valid = validHF
// Column used as target for training
dlParams._response_column = 'target
// Number of passes over data
dlParams._epochs = epochs
// L1 penalty
dlParams._l1 = l1
// Number internal hidden layers
dlParams._hidden = hidden
// Create a DeepLearning job
val dl = new DeepLearning(dlParams)
// And launch it
val dlModel = dl.trainModel.get
// Force computation of model metrics on both datasets
dlModel.score(trainHF).delete()
dlModel.score(validHF).delete()
// And return resulting model
dlModel
}
Initialize H2OContext
and start H2O services on top of Spark:
// Create SQL support
import org.apache.spark.sql._
implicit val sqlContext = SQLContext.getOrCreate(sc)
import sqlContext.implicits._
// Start H2O services
import org.apache.spark.h2o._
val h2oContext = new H2OContext(sc).start()
Open H2O UI and verify that H2O is running:
h2oContext.openFlow
At this point, you can use the H2O UI and see the status of the H2O cloud by typing getCloud
.
Build the final workflow using all building pieces:
// Data load
val dataRDD = load(DATAFILE)
// Extract response column from dataset
val hamSpamRDD = dataRDD.map( r => r(0))
// Extract message from dataset
val messageRDD = dataRDD.map( r => r(1))
// Tokenize message content
val tokensRDD = tokenize(messageRDD)
// Build IDF model on tokenized messages
// It returns
// - hashingTF: hashing function to hash a word to a vector space
// - idfModel: a model to transform hashed sentence to a feature vector
// - tfidf: transformed input messages
var (hashingTF, idfModel, tfidfRDD) = buildIDFModel(tokensRDD)
// Merge response with extracted vectors
val resultDF = hamSpamRDD.zip(tfidfRDD).map(v => SMS(v._1, v._2)).toDF
// Publish Spark DataFrame as H2OFrame
val tableHF = h2oContext.asH2OFrame(resultDF, "messages_table")
// Transform target column into categorical!
tableHF.replace(tableHF.find("target"), tableHF.vec("target").toCategoricalVec()).remove()
tableHF.update(null)
// Split table into training and validation parts
val keys = Array[String]("train.hex", "valid.hex")
val ratios = Array[Double](0.8)
val frs = split(tableHF, keys, ratios)
val (trainHF, validHF) = (frs(0), frs(1))
tableHF.delete()
// Build final DeepLearning model
val dlModel = buildDLModel(trainHF, validHF)(h2oContext)
Evaluate the model's quality:
// Collect model metrics and evaluate model quality
import water.app.ModelMetricsSupport
val trainMetrics = ModelMetricsSupport.binomialMM(dlModel, trainHF)
val validMetrics = ModelMetricsSupport.binomialMM(dlModel, validHF)
println(trainMetrics.auc._auc)
println(validMetrics.auc._auc)
You can also open the H2O UI and typeCreate a spam detector:getPredictions
to visualize the model's performance or typegetModels
to see model output.
// Spam detector
def isSpam(msg: String,
dlModel: DeepLearningModel,
hashingTF: HashingTF,
idfModel: IDFModel,
h2oContext: H2OContext,
hamThreshold: Double = 0.5):String = {
val msgRdd = sc.parallelize(Seq(msg))
val msgVector: DataFrame = idfModel.transform(
hashingTF.transform (
tokenize (msgRdd))).map(v => SMS("?", v)).toDF
val msgTable: H2OFrame = h2oContext.asH2OFrame(msgVector)
msgTable.remove(0) // remove first column
val prediction = dlModel.score(msgTable)
if (prediction.vecs()(1).at(0) < hamThreshold) "SPAM DETECTED!" else "HAM"
}
Try to detect spam:
isSpam("Michal, h2oworld party tonight in MV?", dlModel, hashingTF, idfModel, h2oContext)
//
isSpam("We tried to contact you re your reply to our offer of a Video Handset? 750 anytime any networks mins? UNLIMITED TEXT?", dlModel, hashingTF, idfModel, h2oContext)
At this point, you have finished your 1st Sparkling Water Machine Learning application.
Hack and enjoy! Thank you!
sc
<pyspark.context.SparkContext at 0x102cea1d0>
from pysparkling import *
sc
hc= H2OContext(sc).start()
Warning: Version mismatch.
H2O is version 3.6.0.2, but the python package is version 3.7.0.99999.
H2O cluster uptime: | 2 seconds 217 milliseconds |
H2O cluster version: | 3.6.0.2 |
H2O cluster name: | sparkling-water-nidhimehta |
H2O cluster total nodes: | 2 |
H2O cluster total memory: | 3.83 GB |
H2O cluster total cores: | 16 |
H2O cluster allowed cores: | 16 |
H2O cluster healthy: | True |
H2O Connection ip: | 172.16.2.98 |
H2O Connection port: | 54329 |
hc
H2OContext: ip=172.16.2.98, port=54329
import h2o
#dir(h2o)
column_type = ['Numeric','String','String','Enum','Enum','Enum','Enum','Enum','Enum','Enum','Numeric','Numeric','Numeric','Numeric','Enum','Numeric','Numeric','Numeric','Enum','Numeric','Numeric','Enum']
f_crimes = h2o.import_file(path ="../data/chicagoCrimes10k.csv",col_types =column_type)
print(f_crimes.shape)
f_crimes.summary()
Parse Progress: [##################################################] 100%
(9999, 22)
ID | Case Number | Date | Block | IUCR | Primary Type | Description | Location Description | Arrest | Domestic | Beat | District | Ward | Community Area | FBI Code | X Coordinate | Y Coordinate | Year | Updated On | Latitude | Longitude | Location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
type | int | string | string | enum | enum | enum | enum | enum | enum | enum | int | int | int | int | enum | int | int | int | enum | real | real | enum |
mins | 21735.0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 111.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1100317.0 | 1814255.0 | 2015.0 | 0.0 | 41.64507243 | -87.906463888 | 0.0 |
mean | 9931318.73737 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.292829282928 | 0.152315231523 | 1159.61806181 | 11.3489885128 | 22.9540954095 | 37.4476447645 | NaN | 1163880.59815 | 1885916.14984 | 2015.0 | NaN | 41.8425652247 | -87.6741405221 | NaN |
maxs | 9962898.0 | NaN | NaN | 6517.0 | 212.0 | 26.0 | 198.0 | 90.0 | 1.0 | 1.0 | 2535.0 | 25.0 | 50.0 | 77.0 | 24.0 | 1205069.0 | 1951533.0 | 2015.0 | 32.0 | 42.022646183 | -87.524773286 | 8603.0 |
sigma | 396787.564221 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.455083515588 | 0.35934414686 | 695.76029875 | 6.94547493301 | 13.6495661144 | 21.2748762223 | NaN | 16496.4493681 | 31274.0163199 | 0.0 | NaN | 0.0860186579358 | 0.0600357970653 | NaN |
zeros | 0 | 0 | 0 | 3 | 16 | 11 | 933 | 19 | 7071 | 8476 | 0 | 0 | 0 | 0 | 16 | 0 | 0 | 0 | 603 | 0 | 0 | 1 |
missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 162 | 0 | 0 | 0 | 162 | 162 | 0 | 0 | 162 | 162 | 162 |
0 | 9955810.0 | HY144797 | 02/08/2015 11:43:40 PM | 081XX S COLES AVE | 1811 | NARCOTICS | POSS: CANNABIS 30GMS OR LESS | STREET | true | false | 422.0 | 4.0 | 7.0 | 46.0 | 18 | 1198273.0 | 1851626.0 | 2015.0 | 02/15/2015 12:43:39 PM | 41.747693646 | -87.549035389 | (41.747693646, -87.549035389) |
1 | 9955861.0 | HY144838 | 02/08/2015 11:41:42 PM | 118XX S STATE ST | 0486 | BATTERY | DOMESTIC BATTERY SIMPLE | APARTMENT | true | true | 522.0 | 5.0 | 34.0 | 53.0 | 08B | 1178335.0 | 1826581.0 | 2015.0 | 02/15/2015 12:43:39 PM | 41.679442289 | -87.622850758 | (41.679442289, -87.622850758) |
2 | 9955801.0 | HY144779 | 02/08/2015 11:30:22 PM | 002XX S LARAMIE AVE | 2026 | NARCOTICS | POSS: PCP | SIDEWALK | true | false | 1522.0 | 15.0 | 29.0 | 25.0 | 18 | 1141717.0 | 1898581.0 | 2015.0 | 02/15/2015 12:43:39 PM | 41.87777333 | -87.755117993 | (41.87777333, -87.755117993) |
3 | 9956197.0 | HY144787 | 02/08/2015 11:30:23 PM | 006XX E 67TH ST | 1811 | NARCOTICS | POSS: CANNABIS 30GMS OR LESS | STREET | true | false | 321.0 | nan | 6.0 | 42.0 | 18 | nan | nan | 2015.0 | 02/15/2015 12:43:39 PM | nan | nan | |
4 | 9955846.0 | HY144829 | 02/08/2015 11:30:58 PM | 0000X S MAYFIELD AVE | 0610 | BURGLARY | FORCIBLE ENTRY | APARTMENT | false | false | 1513.0 | 15.0 | 29.0 | 25.0 | 05 | 1137239.0 | 1899372.0 | 2015.0 | 02/15/2015 12:43:39 PM | 41.880025548 | -87.771541324 | (41.880025548, -87.771541324) |
5 | 9955835.0 | HY144778 | 02/08/2015 11:30:21 PM | 010XX W 48TH ST | 0486 | BATTERY | DOMESTIC BATTERY SIMPLE | APARTMENT | false | true | 933.0 | 9.0 | 3.0 | 61.0 | 08B | 1169986.0 | 1873019.0 | 2015.0 | 02/15/2015 12:43:39 PM | 41.807059405 | -87.65206589 | (41.807059405, -87.65206589) |
6 | 9955872.0 | HY144822 | 02/08/2015 11:27:24 PM | 015XX W ARTHUR AVE | 1320 | CRIMINAL DAMAGE | TO VEHICLE | STREET | false | false | 2432.0 | 24.0 | 40.0 | 1.0 | 14 | 1164732.0 | 1943222.0 | 2015.0 | 02/15/2015 12:43:39 PM | 41.999814056 | -87.669342967 | (41.999814056, -87.669342967) |
7 | 21752.0 | HY144738 | 02/08/2015 11:26:12 PM | 060XX W GRAND AVE | 0110 | HOMICIDE | FIRST DEGREE MURDER | STREET | true | false | 2512.0 | 25.0 | 37.0 | 19.0 | 01A | 1135910.0 | 1914206.0 | 2015.0 | 02/15/2015 12:43:39 PM | 41.920755683 | -87.776067514 | (41.920755683, -87.776067514) |
8 | 9955808.0 | HY144775 | 02/08/2015 11:20:33 PM | 001XX W WACKER DR | 0460 | BATTERY | SIMPLE | OTHER | false | false | 122.0 | 1.0 | 42.0 | 32.0 | 08B | 1175384.0 | 1902088.0 | 2015.0 | 02/15/2015 12:43:39 PM | 41.886707818 | -87.631396356 | (41.886707818, -87.631396356) |
9 | 9958275.0 | HY146732 | 02/08/2015 11:15:36 PM | 001XX W WACKER DR | 0460 | BATTERY | SIMPLE | HOTEL/MOTEL | false | false | 122.0 | 1.0 | 42.0 | 32.0 | 08B | 1175384.0 | 1902088.0 | 2015.0 | 02/15/2015 12:43:39 PM | 41.886707818 | -87.631396356 | (41.886707818, -87.631396356) |
f_crimes["IUCR"].table()
IUCR | Count |
---|---|
0110 | 16 |
0261 | 2 |
0263 | 2 |
0265 | 5 |
0266 | 2 |
0281 | 41 |
0291 | 3 |
0312 | 18 |
0313 | 20 |
031A | 136 |
f_crimes["Arrest"].table()
Arrest | Count |
---|---|
false | 7071 |
true | 2928 |
col_names = map(lambda s: s.replace(' ', '_'), f_crimes.col_names)
f_crimes.set_names(col_names)
ID | Case_Number | Date | Block | IUCR | Primary_Type | Description | Location_Description | Arrest | Domestic | Beat | District | Ward | Community_Area | FBI_Code | X_Coordinate | Y_Coordinate | Year | Updated_On | Latitude | Longitude | Location |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9.95581e+06 | HY144797 | 02/08/2015 11:43:40 PM | 081XX S COLES AVE | 1811 | NARCOTICS | POSS: CANNABIS 30GMS OR LESS | STREET | true | false | 422 | 4 | 7 | 46 | 18 | 1.19827e+06 | 1.85163e+06 | 2015 | 02/15/2015 12:43:39 PM | 41.7477 | -87.549 | (41.747693646, -87.549035389) |
9.95586e+06 | HY144838 | 02/08/2015 11:41:42 PM | 118XX S STATE ST | 0486 | BATTERY | DOMESTIC BATTERY SIMPLE | APARTMENT | true | true | 522 | 5 | 34 | 53 | 08B | 1.17834e+06 | 1.82658e+06 | 2015 | 02/15/2015 12:43:39 PM | 41.6794 | -87.6229 | (41.679442289, -87.622850758) |
9.9558e+06 | HY144779 | 02/08/2015 11:30:22 PM | 002XX S LARAMIE AVE | 2026 | NARCOTICS | POSS: PCP | SIDEWALK | true | false | 1522 | 15 | 29 | 25 | 18 | 1.14172e+06 | 1.89858e+06 | 2015 | 02/15/2015 12:43:39 PM | 41.8778 | -87.7551 | (41.87777333, -87.755117993) |
9.9562e+06 | HY144787 | 02/08/2015 11:30:23 PM | 006XX E 67TH ST | 1811 | NARCOTICS | POSS: CANNABIS 30GMS OR LESS | STREET | true | false | 321 | nan | 6 | 42 | 18 | nan | nan | 2015 | 02/15/2015 12:43:39 PM | nan | nan | |
9.95585e+06 | HY144829 | 02/08/2015 11:30:58 PM | 0000X S MAYFIELD AVE | 0610 | BURGLARY | FORCIBLE ENTRY | APARTMENT | false | false | 1513 | 15 | 29 | 25 | 05 | 1.13724e+06 | 1.89937e+06 | 2015 | 02/15/2015 12:43:39 PM | 41.88 | -87.7715 | (41.880025548, -87.771541324) |
9.95584e+06 | HY144778 | 02/08/2015 11:30:21 PM | 010XX W 48TH ST | 0486 | BATTERY | DOMESTIC BATTERY SIMPLE | APARTMENT | false | true | 933 | 9 | 3 | 61 | 08B | 1.16999e+06 | 1.87302e+06 | 2015 | 02/15/2015 12:43:39 PM | 41.8071 | -87.6521 | (41.807059405, -87.65206589) |
9.95587e+06 | HY144822 | 02/08/2015 11:27:24 PM | 015XX W ARTHUR AVE | 1320 | CRIMINAL DAMAGE | TO VEHICLE | STREET | false | false | 2432 | 24 | 40 | 1 | 14 | 1.16473e+06 | 1.94322e+06 | 2015 | 02/15/2015 12:43:39 PM | 41.9998 | -87.6693 | (41.999814056, -87.669342967) |
21752 | HY144738 | 02/08/2015 11:26:12 PM | 060XX W GRAND AVE | 0110 | HOMICIDE | FIRST DEGREE MURDER | STREET | true | false | 2512 | 25 | 37 | 19 | 01A | 1.13591e+06 | 1.91421e+06 | 2015 | 02/15/2015 12:43:39 PM | 41.9208 | -87.7761 | (41.920755683, -87.776067514) |
9.95581e+06 | HY144775 | 02/08/2015 11:20:33 PM | 001XX W WACKER DR | 0460 | BATTERY | SIMPLE | OTHER | false | false | 122 | 1 | 42 | 32 | 08B | 1.17538e+06 | 1.90209e+06 | 2015 | 02/15/2015 12:43:39 PM | 41.8867 | -87.6314 | (41.886707818, -87.631396356) |
9.95828e+06 | HY146732 | 02/08/2015 11:15:36 PM | 001XX W WACKER DR | 0460 | BATTERY | SIMPLE | HOTEL/MOTEL | false | false | 122 | 1 | 42 | 32 | 08B | 1.17538e+06 | 1.90209e+06 | 2015 | 02/15/2015 12:43:39 PM | 41.8867 | -87.6314 | (41.886707818, -87.631396356) |
h2o.set_timezone("Etc/UTC")
def refine_date_col(data, col, pattern):
data[col] = data[col].as_date(pattern)
data["Day"] = data[col].day()
data["Month"] = data[col].month() # Since H2O indexes from 0
data["Year"] = data[col].year()
data["WeekNum"] = data[col].week()
data["WeekDay"] = data[col].dayOfWeek()
data["HourOfDay"] = data[col].hour()
# Create weekend and season cols
data["Weekend"] = (data["WeekDay"] == "Sun" or data["WeekDay"] == "Sat").ifelse(1, 0)[0]
data["Season"] = data["Month"].cut([0, 2, 5, 7, 10, 12], ["Winter", "Spring", "Summer", "Autumn", "Winter"])
refine_date_col(f_crimes, "Date", "%m/%d/%Y %I:%M:%S %p")
f_crimes = f_crimes.drop("Date")
f_census = h2o.import_file("../data/chicagoCensus.csv",header=1)
## Update column names in the table
col_names = map(lambda s: s.strip().replace(' ', '_'), f_census.col_names)
f_census.set_names(col_names)
f_census = f_census[1:78,:]
print(f_census.dim)
#f_census.summary()
Parse Progress: [##################################################] 100%
[77, 9]
f_weather = h2o.import_file("../data/chicagoAllWeather.csv")
f_weather = f_weather[1:]
print(f_weather.dim)
#f_weather.summary()
Parse Progress: [##################################################] 100%
[5162, 6]
f_weather[f_weather["meanTemp"].isna()]
month | day | year | maxTemp | meanTemp | minTemp |
---|---|---|---|---|---|
6 | 19 | 2008 | nan | nan | nan |
9 | 23 | 2008 | nan | nan | nan |
9 | 24 | 2008 | nan | nan | nan |
9 | 25 | 2008 | nan | nan | nan |
9 | 26 | 2008 | nan | nan | nan |
9 | 27 | 2008 | nan | nan | nan |
9 | 28 | 2008 | nan | nan | nan |
9 | 29 | 2008 | nan | nan | nan |
9 | 30 | 2008 | nan | nan | nan |
3 | 4 | 2009 | nan | nan | nan |
as_h2o_frame
hc.as_spark_frame?
f_weather
H2OContext: ip=172.16.2.98, port=54329
month | day | year | maxTemp | meanTemp | minTemp |
---|---|---|---|---|---|
1 | 1 | 2001 | 23 | 14 | 6 |
1 | 2 | 2001 | 18 | 12 | 6 |
1 | 3 | 2001 | 28 | 18 | 8 |
1 | 4 | 2001 | 30 | 24 | 19 |
1 | 5 | 2001 | 36 | 30 | 21 |
1 | 6 | 2001 | 33 | 26 | 19 |
1 | 7 | 2001 | 34 | 28 | 21 |
1 | 8 | 2001 | 26 | 20 | 14 |
1 | 9 | 2001 | 23 | 16 | 10 |
1 | 10 | 2001 | 34 | 26 | 19 |
df_weather = hc.as_spark_frame(f_weather,)
df_census = hc.as_spark_frame(f_census)
df_crimes = hc.as_spark_frame(f_crimes)
df_weather.show(2)
+-----+---+----+-------+--------+-------+
|month|day|year|maxTemp|meanTemp|minTemp|
+-----+---+----+-------+--------+-------+
| 1| 1|2001| 23| 14| 6|
| 1| 2|2001| 18| 12| 6|
+-----+---+----+-------+--------+-------+
## Register DataFrames as tables in SQL context
sqlContext.registerDataFrameAsTable(df_weather, "chicagoWeather")
sqlContext.registerDataFrameAsTable(df_census, "chicagoCensus")
sqlContext.registerDataFrameAsTable(df_crimes, "chicagoCrime")
crimeWithWeather = sqlContext.sql("SELECT
a.Year, a.Month, a.Day, a.WeekNum, a.HourOfDay, a.Weekend, a.Season, a.WeekDay,
a.IUCR, a.Primary_Type, a.Location_Description, a.Community_Area, a.District,
a.Arrest, a.Domestic, a.Beat, a.Ward, a.FBI_Code,
b.minTemp, b.maxTemp, b.meanTemp,
c.PERCENT_AGED_UNDER_18_OR_OVER_64, c.PER_CAPITA_INCOME, c.HARDSHIP_INDEX,
c.PERCENT_OF_HOUSING_CROWDED, c.PERCENT_HOUSEHOLDS_BELOW_POVERTY,
c.PERCENT_AGED_16__UNEMPLOYED, c.PERCENT_AGED_25__WITHOUT_HIGH_SCHOOL_DIPLOMA
FROM chicagoCrime a
JOIN chicagoWeather b
ON a.Year = b.year AND a.Month = b.month AND a.Day = b.day
JOIN chicagoCensus c
ON a.Community_Area = c.Community_Area_Number")
crimeWithWeather
data table from SparkcrimeWithWeather.show(2)
+----+-----+---+-------+---------+-------+------+-------+----+-----------------+--------------------+--------------+--------+------+--------+----+----+--------+-------+-------+--------+--------------------------------+-----------------+--------------+--------------------------+--------------------------------+---------------------------+--------------------------------------------+
|Year|Month|Day|WeekNum|HourOfDay|Weekend|Season|WeekDay|IUCR| Primary_Type|Location_Description|Community_Area|District|Arrest|Domestic|Beat|Ward|FBI_Code|minTemp|maxTemp|meanTemp|PERCENT_AGED_UNDER_18_OR_OVER_64|PER_CAPITA_INCOME|HARDSHIP_INDEX|PERCENT_OF_HOUSING_CROWDED|PERCENT_HOUSEHOLDS_BELOW_POVERTY|PERCENT_AGED_16__UNEMPLOYED|PERCENT_AGED_25__WITHOUT_HIGH_SCHOOL_DIPLOMA|
+----+-----+---+-------+---------+-------+------+-------+----+-----------------+--------------------+--------------+--------+------+--------+----+----+--------+-------+-------+--------+--------------------------------+-----------------+--------------+--------------------------+--------------------------------+---------------------------+--------------------------------------------+
|2015| 1| 23| 4| 22| 0|Winter| Fri|143A|WEAPONS VIOLATION| ALLEY| 31| 12| true| false|1234| 25| 15| 29| 31| 30| 32.6| 16444| 76| 9.600000000000001| 25.8| 15.8| 40.7|
|2015| 1| 23| 4| 19| 0|Winter| Fri|4625| OTHER OFFENSE| SIDEWALK| 31| 10| true| false|1034| 25| 26| 29| 31| 30| 32.6| 16444| 76| 9.600000000000001| 25.8| 15.8| 40.7|
+----+-----+---+-------+---------+-------+------+-------+----+-----------------+--------------------+--------------+--------+------+--------+----+----+--------+-------+-------+--------+--------------------------------+-----------------+--------------+--------------------------+--------------------------------+---------------------------+--------------------------------------------+
only showing top 2 rows
hc.as_h2o_frame?
crimeWithWeatherHF = hc.as_h2o_frame(crimeWithWeather,framename="crimeWithWeather")
H2OContext: ip=172.16.2.98, port=54329
crimeWithWeatherHF.summary()
Year | Month | Day | WeekNum | HourOfDay | Weekend | Season | WeekDay | IUCR | Primary_Type | Location_Description | Community_Area | District | Arrest | Domestic | Beat | Ward | FBI_Code | minTemp | maxTemp | meanTemp | PERCENT_AGED_UNDER_18_OR_OVER_64 | PER_CAPITA_INCOME | HARDSHIP_INDEX | PERCENT_OF_HOUSING_CROWDED | PERCENT_HOUSEHOLDS_BELOW_POVERTY | PERCENT_AGED_16__UNEMPLOYED | PERCENT_AGED_25__WITHOUT_HIGH_SCHOOL_DIPLOMA | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
type | int | int | int | int | int | int | string | string | string | string | string | int | int | string | string | int | int | string | int | int | int | real | int | int | real | real | real | real |
mins | 2015.0 | 1.0 | 1.0 | 4.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | 1.0 | 1.0 | NaN | NaN | 111.0 | 1.0 | NaN | -2.0 | 15.0 | 7.0 | 13.5 | 8201.0 | 1.0 | 0.3 | 3.3 | 4.7 | 2.5 |
mean | 2015.0 | 1.41944194419 | 17.6839683968 | 5.18081808181 | 13.6319631963 | 0.159115911591 | NaN | NaN | NaN | NaN | NaN | 37.4476447645 | 11.3489885128 | NaN | NaN | 1159.61806181 | 22.9540954095 | NaN | 17.699669967 | 31.7199719972 | 24.9408940894 | 35.0596759676 | 25221.3057306 | 54.4786478648 | 5.43707370737 | 24.600750075 | 16.8288328833 | 21.096639664 |
maxs | 2015.0 | 2.0 | 31.0 | 6.0 | 23.0 | 1.0 | NaN | NaN | NaN | NaN | NaN | 77.0 | 25.0 | NaN | NaN | 2535.0 | 50.0 | NaN | 29.0 | 43.0 | 36.0 | 51.5 | 88669.0 | 98.0 | 15.8 | 56.5 | 35.9 | 54.8 |
sigma | 0.0 | 0.493492406787 | 11.1801043358 | 0.738929830409 | 6.47321735807 | 0.365802434041 | NaN | NaN | NaN | NaN | NaN | 21.2748762223 | 6.94547493301 | NaN | NaN | 695.76029875 | 13.6495661144 | NaN | 8.96118136438 | 6.93809913472 | 7.46302527062 | 7.95653388237 | 18010.0446225 | 29.3247456472 | 3.75289588494 | 10.1450570661 | 7.58926327988 | 11.3868817911 |
zeros | 0 | 0 | 0 | 0 | 374 | 8408 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 162 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 2015.0 | 1.0 | 24.0 | 4.0 | 22.0 | 0.0 | Winter | Sat | 2820 | OTHER OFFENSE | APARTMENT | 31.0 | 10.0 | false | false | 1034.0 | 25.0 | 26 | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
1 | 2015.0 | 1.0 | 24.0 | 4.0 | 21.0 | 0.0 | Winter | Sat | 1310 | CRIMINAL DAMAGE | RESTAURANT | 31.0 | 12.0 | true | false | 1233.0 | 25.0 | 14 | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
2 | 2015.0 | 1.0 | 24.0 | 4.0 | 18.0 | 0.0 | Winter | Sat | 1750 | OFFENSE INVOLVING CHILDREN | RESIDENCE | 31.0 | 12.0 | false | true | 1235.0 | 25.0 | 20 | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
3 | 2015.0 | 1.0 | 24.0 | 4.0 | 18.0 | 0.0 | Winter | Sat | 0460 | BATTERY | OTHER | 31.0 | 10.0 | false | false | 1023.0 | 25.0 | 08B | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
4 | 2015.0 | 1.0 | 24.0 | 4.0 | 13.0 | 0.0 | Winter | Sat | 0890 | THEFT | CURRENCY EXCHANGE | 31.0 | 10.0 | false | false | 1023.0 | 25.0 | 06 | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
5 | 2015.0 | 1.0 | 24.0 | 4.0 | 9.0 | 0.0 | Winter | Sat | 0560 | ASSAULT | OTHER | 31.0 | 12.0 | false | false | 1234.0 | 25.0 | 08A | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
6 | 2015.0 | 1.0 | 24.0 | 4.0 | 8.0 | 0.0 | Winter | Sat | 0486 | BATTERY | RESIDENCE | 31.0 | 12.0 | true | true | 1235.0 | 25.0 | 08B | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
7 | 2015.0 | 1.0 | 24.0 | 4.0 | 1.0 | 0.0 | Winter | Sat | 0420 | BATTERY | SIDEWALK | 31.0 | 10.0 | false | false | 1034.0 | 25.0 | 04B | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
8 | 2015.0 | 1.0 | 24.0 | 4.0 | 0.0 | 0.0 | Winter | Sat | 1320 | CRIMINAL DAMAGE | PARKING LOT/GARAGE(NON.RESID.) | 31.0 | 9.0 | false | false | 912.0 | 11.0 | 14 | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
9 | 2015.0 | 1.0 | 31.0 | 5.0 | 23.0 | 0.0 | Winter | Sat | 0820 | THEFT | SIDEWALK | 31.0 | 12.0 | false | false | 1234.0 | 25.0 | 06 | 19.0 | 36.0 | 28.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
CrimeWeatherHF
data table in H2OcrimeWithWeatherHF["Season"]= crimeWithWeatherHF["Season"].asfactor()
crimeWithWeatherHF["WeekDay"]= crimeWithWeatherHF["WeekDay"].asfactor()
crimeWithWeatherHF["IUCR"]= crimeWithWeatherHF["IUCR"].asfactor()
crimeWithWeatherHF["Primary_Type"]= crimeWithWeatherHF["Primary_Type"].asfactor()
crimeWithWeatherHF["Location_Description"]= crimeWithWeatherHF["Location_Description"].asfactor()
crimeWithWeatherHF["Arrest"]= crimeWithWeatherHF["Arrest"].asfactor()
crimeWithWeatherHF["Domestic"]= crimeWithWeatherHF["Domestic"].asfactor()
crimeWithWeatherHF["FBI_Code"]= crimeWithWeatherHF["FBI_Code"].asfactor()
crimeWithWeatherHF["Season"]= crimeWithWeatherHF["Season"].asfactor()
crimeWithWeatherHF.summary()
Year | Month | Day | WeekNum | HourOfDay | Weekend | Season | WeekDay | IUCR | Primary_Type | Location_Description | Community_Area | District | Arrest | Domestic | Beat | Ward | FBI_Code | minTemp | maxTemp | meanTemp | PERCENT_AGED_UNDER_18_OR_OVER_64 | PER_CAPITA_INCOME | HARDSHIP_INDEX | PERCENT_OF_HOUSING_CROWDED | PERCENT_HOUSEHOLDS_BELOW_POVERTY | PERCENT_AGED_16__UNEMPLOYED | PERCENT_AGED_25__WITHOUT_HIGH_SCHOOL_DIPLOMA | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
type | int | int | int | int | int | int | enum | enum | enum | enum | enum | int | int | enum | enum | int | int | enum | int | int | int | real | int | int | real | real | real | real |
mins | 2015.0 | 1.0 | 1.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 111.0 | 1.0 | 0.0 | -2.0 | 15.0 | 7.0 | 13.5 | 8201.0 | 1.0 | 0.3 | 3.3 | 4.7 | 2.5 |
mean | 2015.0 | 1.41944194419 | 17.6839683968 | 5.18081808181 | 13.6319631963 | 0.159115911591 | 0.0 | NaN | NaN | NaN | NaN | 37.4476447645 | 11.3489885128 | 0.292829282928 | 0.152315231523 | 1159.61806181 | 22.9540954095 | NaN | 17.699669967 | 31.7199719972 | 24.9408940894 | 35.0596759676 | 25221.3057306 | 54.4786478648 | 5.43707370737 | 24.600750075 | 16.8288328833 | 21.096639664 |
maxs | 2015.0 | 2.0 | 31.0 | 6.0 | 23.0 | 1.0 | 0.0 | 6.0 | 212.0 | 26.0 | 90.0 | 77.0 | 25.0 | 1.0 | 1.0 | 2535.0 | 50.0 | 24.0 | 29.0 | 43.0 | 36.0 | 51.5 | 88669.0 | 98.0 | 15.8 | 56.5 | 35.9 | 54.8 |
sigma | 0.0 | 0.493492406787 | 11.1801043358 | 0.738929830409 | 6.47321735807 | 0.365802434041 | 0.0 | NaN | NaN | NaN | NaN | 21.2748762223 | 6.94547493301 | 0.455083515588 | 0.35934414686 | 695.76029875 | 13.6495661144 | NaN | 8.96118136438 | 6.93809913472 | 7.46302527062 | 7.95653388237 | 18010.0446225 | 29.3247456472 | 3.75289588494 | 10.1450570661 | 7.58926327988 | 11.3868817911 |
zeros | 0 | 0 | 0 | 0 | 374 | 8408 | 9999 | 1942 | 16 | 11 | 19 | 0 | 0 | 7071 | 8476 | 0 | 0 | 16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 162 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 2015.0 | 1.0 | 24.0 | 4.0 | 22.0 | 0.0 | Winter | Sat | 2820 | OTHER OFFENSE | APARTMENT | 31.0 | 10.0 | false | false | 1034.0 | 25.0 | 26 | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
1 | 2015.0 | 1.0 | 24.0 | 4.0 | 21.0 | 0.0 | Winter | Sat | 1310 | CRIMINAL DAMAGE | RESTAURANT | 31.0 | 12.0 | true | false | 1233.0 | 25.0 | 14 | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
2 | 2015.0 | 1.0 | 24.0 | 4.0 | 18.0 | 0.0 | Winter | Sat | 1750 | OFFENSE INVOLVING CHILDREN | RESIDENCE | 31.0 | 12.0 | false | true | 1235.0 | 25.0 | 20 | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
3 | 2015.0 | 1.0 | 24.0 | 4.0 | 18.0 | 0.0 | Winter | Sat | 0460 | BATTERY | OTHER | 31.0 | 10.0 | false | false | 1023.0 | 25.0 | 08B | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
4 | 2015.0 | 1.0 | 24.0 | 4.0 | 13.0 | 0.0 | Winter | Sat | 0890 | THEFT | CURRENCY EXCHANGE | 31.0 | 10.0 | false | false | 1023.0 | 25.0 | 06 | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
5 | 2015.0 | 1.0 | 24.0 | 4.0 | 9.0 | 0.0 | Winter | Sat | 0560 | ASSAULT | OTHER | 31.0 | 12.0 | false | false | 1234.0 | 25.0 | 08A | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
6 | 2015.0 | 1.0 | 24.0 | 4.0 | 8.0 | 0.0 | Winter | Sat | 0486 | BATTERY | RESIDENCE | 31.0 | 12.0 | true | true | 1235.0 | 25.0 | 08B | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
7 | 2015.0 | 1.0 | 24.0 | 4.0 | 1.0 | 0.0 | Winter | Sat | 0420 | BATTERY | SIDEWALK | 31.0 | 10.0 | false | false | 1034.0 | 25.0 | 04B | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
8 | 2015.0 | 1.0 | 24.0 | 4.0 | 0.0 | 0.0 | Winter | Sat | 1320 | CRIMINAL DAMAGE | PARKING LOT/GARAGE(NON.RESID.) | 31.0 | 9.0 | false | false | 912.0 | 11.0 | 14 | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
9 | 2015.0 | 1.0 | 31.0 | 5.0 | 23.0 | 0.0 | Winter | Sat | 0820 | THEFT | SIDEWALK | 31.0 | 12.0 | false | false | 1234.0 | 25.0 | 06 | 19.0 | 36.0 | 28.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
ratios = [0.6,0.2]
frs = crimeWithWeatherHF.split_frame(ratios,seed=12345)
train = frs[0]
train.frame_id = "Train"
valid = frs[2]
valid.frame_id = "Validation"
test = frs[1]
test.frame_id = "Test"
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
H2OGradientBoostingEstimator?
predictors = crimeWithWeatherHF.names[:]
response = "Arrest"
predictors.remove(response)
model_gbm = H2OGradientBoostingEstimator(ntrees =50,
max_depth =6,
learn_rate =0.1,
#nfolds =2,
distribution ="bernoulli")
model_gbm.train(x =predictors,
y ="Arrest",
training_frame =train,
validation_frame=valid
)
model_dl = H2ODeepLearningEstimator(variable_importances=True,
loss ="Automatic")
model_dl.train(x =predictors,
y ="Arrest",
training_frame =train,
validation_frame=valid)
gbm Model Build Progress: [##################################################] 100%
deeplearning Model Build Progress: [##################################################] 100%
print(model_gbm.confusion_matrix(train = True))
print(model_gbm.confusion_matrix(valid = True))
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.335827722991:
false | true | Error | Rate | |
false | 4125.0 | 142.0 | 0.0333 | (142.0/4267.0) |
true | 251.0 | 1504.0 | 0.143 | (251.0/1755.0) |
Total | 4376.0 | 1646.0 | 0.0653 | (393.0/6022.0) |
false | true | Error | Rate | |
false | 1362.0 | 61.0 | 0.0429 | (61.0/1423.0) |
true | 150.0 | 443.0 | 0.253 | (150.0/593.0) |
Total | 1512.0 | 504.0 | 0.1047 | (211.0/2016.0) |
print(model_gbm.auc(train=True))
print(model_gbm.auc(valid=True))
model_gbm.plot(metric="AUC")
0.974667176776
0.92596751276
model_gbm.varimp(True)
variable | relative_importance | scaled_importance | percentage | |
---|---|---|---|---|
0 | IUCR | 4280.939453 | 1.000000e+00 | 8.234218e-01 |
1 | Location_Description | 487.323059 | 1.138355e-01 | 9.373466e-02 |
2 | WeekDay | 55.790558 | 1.303232e-02 | 1.073109e-02 |
3 | HourOfDay | 55.419220 | 1.294557e-02 | 1.065967e-02 |
4 | PERCENT_AGED_16__UNEMPLOYED | 34.422894 | 8.040967e-03 | 6.621107e-03 |
5 | Beat | 31.468222 | 7.350775e-03 | 6.052788e-03 |
6 | PERCENT_HOUSEHOLDS_BELOW_POVERTY | 29.103352 | 6.798356e-03 | 5.597915e-03 |
7 | PER_CAPITA_INCOME | 26.233143 | 6.127894e-03 | 5.045841e-03 |
8 | PERCENT_AGED_UNDER_18_OR_OVER_64 | 24.077402 | 5.624327e-03 | 4.631193e-03 |
9 | Day | 23.472567 | 5.483041e-03 | 4.514855e-03 |
... | ... | ... | ... | ... |
15 | maxTemp | 11.300793 | 2.639793e-03 | 2.173663e-03 |
16 | Community_Area | 10.252146 | 2.394835e-03 | 1.971960e-03 |
17 | HARDSHIP_INDEX | 10.116072 | 2.363049e-03 | 1.945786e-03 |
18 | Domestic | 9.294327 | 2.171095e-03 | 1.787727e-03 |
19 | District | 8.304654 | 1.939914e-03 | 1.597367e-03 |
20 | minTemp | 6.243027 | 1.458331e-03 | 1.200822e-03 |
21 | WeekNum | 4.230102 | 9.881246e-04 | 8.136433e-04 |
22 | FBI_Code | 2.363182 | 5.520241e-04 | 4.545486e-04 |
23 | Month | 0.000018 | 4.187325e-09 | 3.447935e-09 |
24 | Weekend | 0.000000 | 0.000000e+00 | 0.000000e+00 |
model_dl
Model Details
=============
H2ODeepLearningEstimator : Deep Learning
Model Key: DeepLearning_model_python_1446861372065_4
Status of Neuron Layers: predicting Arrest, 2-class classification, bernoulli distribution, CrossEntropy loss, 118,802 weights/biases, 1.4 MB, 72,478 training samples, mini-batch size 1
layer | units | type | dropout | l1 | l2 | mean_rate | rate_RMS | momentum | mean_weight | weight_RMS | mean_bias | bias_RMS | |
1 | 390 | Input | 0.0 | ||||||||||
2 | 200 | Rectifier | 0.0 | 0.0 | 0.0 | 0.1 | 0.3 | 0.0 | -0.0 | 0.1 | -0.0 | 0.1 | |
3 | 200 | Rectifier | 0.0 | 0.0 | 0.0 | 0.1 | 0.2 | 0.0 | -0.0 | 0.1 | 0.8 | 0.2 | |
4 | 2 | Softmax | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.4 | -0.0 | 0.0 |
MSE: 0.0737426129728
R^2: 0.642891439669
LogLoss: 0.242051500943
AUC: 0.950131166302
Gini: 0.900262332604
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.343997370612:
false | true | Error | Rate | |
false | 4003.0 | 264.0 | 0.0619 | (264.0/4267.0) |
true | 358.0 | 1397.0 | 0.204 | (358.0/1755.0) |
Total | 4361.0 | 1661.0 | 0.1033 | (622.0/6022.0) |
metric | threshold | value | idx |
max f1 | 0.3 | 0.8 | 195.0 |
max f2 | 0.2 | 0.9 | 278.0 |
max f0point5 | 0.7 | 0.9 | 86.0 |
max accuracy | 0.5 | 0.9 | 149.0 |
max precision | 1.0 | 1.0 | 0.0 |
max absolute_MCC | 0.3 | 0.7 | 195.0 |
max min_per_class_accuracy | 0.2 | 0.9 | 247.0 |
ModelMetricsBinomial: deeplearning
** Reported on validation data.
**
MSE: 0.0843305429737
R^2: 0.593831388139
LogLoss: 0.280203809486
AUC: 0.930515181213
Gini: 0.861030362427
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.493462351545:
false | true | Error | Rate | |
false | 1361.0 | 62.0 | 0.0436 | (62.0/1423.0) |
true | 158.0 | 435.0 | 0.2664 | (158.0/593.0) |
Total | 1519.0 | 497.0 | 0.1091 | (220.0/2016.0) |
metric | threshold | value | idx |
max f1 | 0.5 | 0.8 | 137.0 |
max f2 | 0.1 | 0.8 | 303.0 |
max f0point5 | 0.7 | 0.9 | 82.0 |
max accuracy | 0.7 | 0.9 | 91.0 |
max precision | 1.0 | 1.0 | 0.0 |
max absolute_MCC | 0.7 | 0.7 | 91.0 |
max min_per_class_accuracy | 0.2 | 0.8 | 236.0 |
timestamp | duration | training_speed | epochs | samples | training_MSE | training_r2 | training_logloss | training_AUC | training_classification_error | validation_MSE | validation_r2 | validation_logloss | validation_AUC | validation_classification_error | |
2015-11-06 17:57:05 | 0.000 sec | None | 0.0 | 0.0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | |
2015-11-06 17:57:09 | 2.899 sec | 2594 rows/sec | 1.0 | 6068.0 | 0.1 | 0.3 | 0.6 | 0.9 | 0.1 | 0.1 | 0.3 | 0.6 | 0.9 | 0.1 | |
2015-11-06 17:57:15 | 9.096 sec | 5465 rows/sec | 7.3 | 43742.0 | 0.1 | 0.6 | 0.3 | 0.9 | 0.1 | 0.1 | 0.6 | 0.3 | 0.9 | 0.1 | |
2015-11-06 17:57:19 | 12.425 sec | 6571 rows/sec | 12.0 | 72478.0 | 0.1 | 0.6 | 0.2 | 1.0 | 0.1 | 0.1 | 0.6 | 0.3 | 0.9 | 0.1 |
variable | relative_importance | scaled_importance | percentage |
Domestic.false | 1.0 | 1.0 | 0.0 |
Primary_Type.NARCOTICS | 0.9 | 0.9 | 0.0 |
IUCR.0860 | 0.8 | 0.8 | 0.0 |
FBI_Code.18 | 0.8 | 0.8 | 0.0 |
IUCR.4625 | 0.7 | 0.7 | 0.0 |
--- | --- | --- | --- |
Location_Description.missing(NA) | 0.0 | 0.0 | 0.0 |
Primary_Type.missing(NA) | 0.0 | 0.0 | 0.0 |
FBI_Code.missing(NA) | 0.0 | 0.0 | 0.0 |
WeekDay.missing(NA) | 0.0 | 0.0 | 0.0 |
Domestic.missing(NA) | 0.0 | 0.0 | 0.0 |
predictions = model_gbm.predict(test)
predictions.show()
predict | false | true |
---|---|---|
false | 0.946415 | 0.0535847 |
false | 0.862165 | 0.137835 |
false | 0.938661 | 0.0613392 |
false | 0.870186 | 0.129814 |
false | 0.980488 | 0.0195118 |
false | 0.972006 | 0.0279937 |
false | 0.990995 | 0.00900489 |
true | 0.0210692 | 0.978931 |
false | 0.693061 | 0.306939 |
false | 0.992097 | 0.00790253 |
test_performance = model_gbm.model_performance(test)
test_performance
ModelMetricsBinomial: gbm
** Reported on test data.
**
MSE: 0.0893676876445
R^2: 0.57094394422
LogLoss: 0.294019576922
AUC: 0.922152238508
Gini: 0.844304477016
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.365461652105:
false | true | Error | Rate | |
false | 1297.0 | 84.0 | 0.0608 | (84.0/1381.0) |
true | 153.0 | 427.0 | 0.2638 | (153.0/580.0) |
Total | 1450.0 | 511.0 | 0.1209 | (237.0/1961.0) |
metric | threshold | value | idx |
max f1 | 0.4 | 0.8 | 158.0 |
max f2 | 0.1 | 0.8 | 295.0 |
max f0point5 | 0.7 | 0.9 | 97.0 |
max accuracy | 0.6 | 0.9 | 112.0 |
max precision | 1.0 | 1.0 | 0.0 |
max absolute_MCC | 0.6 | 0.7 | 112.0 |
max min_per_class_accuracy | 0.2 | 0.8 | 235.0 |
# Create table to report Crimetype, Arrest count per crime, total reported count per Crime
sqlContext.registerDataFrameAsTable(df_crimes, "df_crimes")
allCrimes = sqlContext.sql("SELECT Primary_Type, count(*) as all_count FROM df_crimes GROUP BY Primary_Type")
crimesWithArrest = sqlContext.sql("SELECT Primary_Type, count(*) as crime_count FROM chicagoCrime WHERE Arrest = 'true' GROUP BY Primary_Type")
sqlContext.registerDataFrameAsTable(crimesWithArrest, "crimesWithArrest")
sqlContext.registerDataFrameAsTable(allCrimes, "allCrimes")
crime_type = sqlContext.sql("Select a.Primary_Type as Crime_Type, a.crime_count, b.all_count \
FROM crimesWithArrest a \
JOIN allCrimes b \
ON a.Primary_Type = b.Primary_Type ")
crime_type.show(12)
+--------------------+-----------+---------+
| Crime_Type|crime_count|all_count|
+--------------------+-----------+---------+
| OTHER OFFENSE| 183| 720|
| WEAPONS VIOLATION| 96| 118|
| DECEPTIVE PRACTICE| 25| 445|
| BURGLARY| 14| 458|
| BATTERY| 432| 1851|
| ROBBERY| 17| 357|
| MOTOR VEHICLE THEFT| 17| 414|
| PROSTITUTION| 106| 106|
| CRIMINAL DAMAGE| 76| 1003|
| KIDNAPPING| 1| 7|
| GAMBLING| 3| 3|
|LIQUOR LAW VIOLATION| 12| 12|
+--------------------+-----------+---------+
only showing top 12 rows
crime_typeHF = hc.as_h2o_frame(crime_type,framename="crime_type")
crime_typeHF["Arrest_rate"] = crime_typeHF["crime_count"]/crime_typeHF["all_count"]
crime_typeHF["Crime_proportion"] = crime_typeHF["all_count"]/crime_typeHF["all_count"].sum()
crime_typeHF["Crime_Type"] = crime_typeHF["Crime_Type"].asfactor()
# h2o.assign(crime_typeHF,crime_type)
crime_typeHF.frame_id = "Crime_type"
crime_typeHF
Crime_Type | crime_count | all_count | Arrest_rate | Crime_proportion |
---|---|---|---|---|
OTHER OFFENSE | 183 | 720 | 0.254167 | 0.0721226 |
WEAPONS VIOLATION | 96 | 118 | 0.813559 | 0.0118201 |
DECEPTIVE PRACTICE | 25 | 445 | 0.0561798 | 0.0445758 |
BURGLARY | 14 | 458 | 0.0305677 | 0.045878 |
BATTERY | 432 | 1851 | 0.233387 | 0.185415 |
ROBBERY | 17 | 357 | 0.047619 | 0.0357608 |
MOTOR VEHICLE THEFT | 17 | 414 | 0.0410628 | 0.0414705 |
PROSTITUTION | 106 | 106 | 1 | 0.0106181 |
CRIMINAL DAMAGE | 76 | 1003 | 0.0757727 | 0.100471 |
KIDNAPPING | 1 | 7 | 0.142857 | 0.000701192 |
hc
H2OContext: ip=172.16.2.98, port=54329
plot (g) -> g(
g.rect(
g.position "Crime_Type", "Arrest_rate"
g.fillColor g.value 'blue'
g.fillOpacity g.value 0.75
)
g.rect(
g.position "Crime_Type", "Crime_proportion"
g.fillColor g.value 'red'
g.fillOpacity g.value 0.65
)
g.from inspect "data", getFrame "Crime_type"
)
#hc.stop()